Bug Description
AGENTS.md states a hard policy on prompt caching:
Prompt Caching Must Not Break
Hermes-Agent ensures caching remains valid throughout a conversation. Do NOT implement changes that would:
- Alter past context mid-conversation
- Change toolsets mid-conversation
- Reload memories or rebuild system prompts mid-conversation
Cache-breaking forces dramatically higher costs. The ONLY time we alter context is during context compression.
The Honcho memory provider violates this directly.
Per the Prompt Assembly architecture doc, Layer 3 of the cached system prompt is the "Honcho static block (when active)". Per the Honcho Memory integration doc, that same block is:
- Base context — session summary, user representation, user peer card, AI self-representation, AI identity card. "Refreshed on
contextCadence."
- Dialectic supplement — LLM-synthesized reasoning about the user's current state. "Refreshed on
dialecticCadence."
Every refresh rewrites Layer 3 of the cached prompt mid-conversation, invalidating the entire KV cache on every backend that does prefix caching:
- Local:
llama.cpp, vLLM --enable-prefix-caching, MLX with PagedAttention
- Cloud: Anthropic / OpenAI / Bedrock with explicit cache control
Causal attention means mid-prefix mutations invalidate everything after the first divergence — every token downstream of the Honcho block must be reprocessed from scratch. On a ~20K-token session that is a multi-minute reprocess on every cadence tick.
The same Prompt Assembly doc already classifies "later-turn Honcho recall injected into the current-turn user message" as an API-call-time-only layer (explicitly not persisted to the cached system prompt) — exactly the correct pattern. The auto-injected base / dialectic block just hasn't been moved over.
Steps to Reproduce
- Configure Hermes with
recallMode: \"hybrid\" (or \"context\") and any finite contextCadence / dialecticCadence.
- Point Hermes at a backend with prefix caching enabled (e.g. local
vllm --enable-prefix-caching, llama.cpp, MLX via vLLM-MLX).
- Send the first user message — first-turn full prefill is expected.
- Send N−1 more messages where N =
min(contextCadence, dialecticCadence). Prefix cache hits as expected.
- Send one more message. Observe: full prefill from scratch, not just the new tail tokens.
Cross-check: run the same backend + model under a client that does not mutate the system prompt mid-session (e.g. opencode). Prefix cache hits on every turn, confirming the backend is not at fault.
Expected Behavior
Per the stated policy: the cached system prompt is byte-identical for the duration of a session (compression excepted).
Auto-injected Honcho context rides on the current user message, same path as the existing `_ext_prefetch_cache` injection (`run_agent.py` ~L6737–6740), and is therefore ephemeral and cache-safe.
Actual Behavior
Every `contextCadence` / `dialecticCadence` tick rewrites Layer 3 of the cached system prompt.
Observed on Gemma 4 31B 8-bit (MLX) served via a `vllm-mlx` fork with PagedAttention prefix caching:
| `contextCadence` |
Cache-miss rate |
Per-turn latency on miss |
| `1` |
100 % of turns |
~5 min (≈20K tokens full prefill) |
| `4` |
~25 % of turns |
~5 min on every 4th turn |
With `contextCadence: 1` every user turn is a full prefill. Bumping to 4 reduces the rate to one in four, but the underlying design is still wrong — on any multi-turn session, some fraction of turns will always incur a full prefill regardless of how it is tuned.
Root Cause
`agent/prompt_builder.py` emits the Honcho static block inside `_build_system_prompt()`, whose result is cached on `self._cached_system_prompt`. The Honcho memory provider's `prefetch_all()` result populates this block and is deliberately refreshed on cadence, contradicting the cached-prompt contract.
The correct injection site already exists: `_ext_prefetch_cache` in `run_agent.py` appends per-turn context to the current user message (same file, ~L6737–6740 — also referenced in #5719). Only a subset of Honcho output currently goes through that path.
Proposed Fix
- Remove the Honcho static block from Layer 3 of `_build_system_prompt()` in `agent/prompt_builder.py`.
- Extend `_ext_prefetch_cache` (or an equivalent per-turn slot) in `run_agent.py` to carry the formatted `## Honcho Context` block for the current turn only.
- Document and enforce the invariant: the cached system prompt is byte-stable for the session; any Honcho content that varies turn-to-turn must ride on the current user message.
This is the same architectural move already proposed in #3353 for runtime metadata, and enforces the policy stated in `AGENTS.md`. It subsumes the `contextCadence` / `dialecticCadence` knobs under the same correctness guarantee instead of requiring users to manually trade off memory freshness against prefill cost.
Environment
| Component |
Value |
| Hermes Agent |
v0.10.0 (2026.4.16) — 34 commits behind `main` |
| Python |
3.11.15 |
| OpenAI SDK |
2.31.0 |
| OS |
macOS 26.4.1 (Darwin 25.4.0, `arm64 T6020` / M2 Pro) |
| Client host |
`sparta.local` |
| Inference backend |
Separate M2 Ultra (128 GB) running a custom `mlx-hosting` glue in front of a `vllm-mlx` fork with PagedAttention prefix caching; model Gemma 4 31B 8-bit MLX |
| Honcho |
Self-hosted, `baseUrl: http://127.0.0.1:8000\` |
Relevant Honcho config (`~/.hermes/honcho.json`):
```json
{
"recallMode": "hybrid",
"contextCadence": 4,
"dialecticCadence": 4,
"dialecticDepth": 1,
"dialecticReasoningLevel": "medium",
"observationMode": "directional",
"writeFrequency": "async",
"sessionStrategy": "per-directory"
}
```
Related Issues
Same architectural class — mid-session mutation of cached prompt state:
Adjacent Honcho issues (for completeness, not duplicates of this bug):
Bug Description
AGENTS.mdstates a hard policy on prompt caching:The Honcho memory provider violates this directly.
Per the Prompt Assembly architecture doc, Layer 3 of the cached system prompt is the "Honcho static block (when active)". Per the Honcho Memory integration doc, that same block is:
contextCadence."dialecticCadence."Every refresh rewrites Layer 3 of the cached prompt mid-conversation, invalidating the entire KV cache on every backend that does prefix caching:
llama.cpp,vLLM --enable-prefix-caching, MLX with PagedAttentionCausal attention means mid-prefix mutations invalidate everything after the first divergence — every token downstream of the Honcho block must be reprocessed from scratch. On a ~20K-token session that is a multi-minute reprocess on every cadence tick.
The same Prompt Assembly doc already classifies "later-turn Honcho recall injected into the current-turn user message" as an API-call-time-only layer (explicitly not persisted to the cached system prompt) — exactly the correct pattern. The auto-injected base / dialectic block just hasn't been moved over.
Steps to Reproduce
recallMode: \"hybrid\"(or\"context\") and any finitecontextCadence/dialecticCadence.vllm --enable-prefix-caching,llama.cpp, MLX via vLLM-MLX).min(contextCadence, dialecticCadence). Prefix cache hits as expected.Cross-check: run the same backend + model under a client that does not mutate the system prompt mid-session (e.g.
opencode). Prefix cache hits on every turn, confirming the backend is not at fault.Expected Behavior
Per the stated policy: the cached system prompt is byte-identical for the duration of a session (compression excepted).
Auto-injected Honcho context rides on the current user message, same path as the existing `_ext_prefetch_cache` injection (`run_agent.py` ~L6737–6740), and is therefore ephemeral and cache-safe.
Actual Behavior
Every `contextCadence` / `dialecticCadence` tick rewrites Layer 3 of the cached system prompt.
Observed on Gemma 4 31B 8-bit (MLX) served via a `vllm-mlx` fork with PagedAttention prefix caching:
With `contextCadence: 1` every user turn is a full prefill. Bumping to 4 reduces the rate to one in four, but the underlying design is still wrong — on any multi-turn session, some fraction of turns will always incur a full prefill regardless of how it is tuned.
Root Cause
`agent/prompt_builder.py` emits the Honcho static block inside `_build_system_prompt()`, whose result is cached on `self._cached_system_prompt`. The Honcho memory provider's `prefetch_all()` result populates this block and is deliberately refreshed on cadence, contradicting the cached-prompt contract.
The correct injection site already exists: `_ext_prefetch_cache` in `run_agent.py` appends per-turn context to the current user message (same file, ~L6737–6740 — also referenced in #5719). Only a subset of Honcho output currently goes through that path.
Proposed Fix
This is the same architectural move already proposed in #3353 for runtime metadata, and enforces the policy stated in `AGENTS.md`. It subsumes the `contextCadence` / `dialecticCadence` knobs under the same correctness guarantee instead of requiring users to manually trade off memory freshness against prefill cost.
Environment
Relevant Honcho config (`~/.hermes/honcho.json`):
```json
{
"recallMode": "hybrid",
"contextCadence": 4,
"dialecticCadence": 4,
"dialecticDepth": 1,
"dialecticReasoningLevel": "medium",
"observationMode": "directional",
"writeFrequency": "async",
"sessionStrategy": "per-directory"
}
```
Related Issues
Same architectural class — mid-session mutation of cached prompt state:
Adjacent Honcho issues (for completeness, not duplicates of this bug):