Bug Description
When using a local llama.cpp backend (via lemonade/llama-server) with a single slot (-np 1), the KV cache is fully invalidated every time a new user message arrives — even though the cache works perfectly during agentic iteration (tool call loops within the same turn).
This causes a full reprocessing of the entire context (~100K tokens, taking ~10 minutes) on every new user message, despite the llama.cpp server having the previous context fully cached.
Root cause: The messages sent to the API during the agentic loop differ from the messages sent when the gateway reloads history from the session store for a new user message. This produces different tokenization via the chat template, invalidating the prefix cache.
Three specific differences were identified:
1. Extra fields in tool_calls
During the agentic loop, tool_calls sent to the API have standard fields:
{"id": "abc123", "type": "function", "function": {...}}
But after session reload, the gateway history loader passes through call_id and response_item_id:
{"id": "abc123", "call_id": "abc123", "response_item_id": "fc_abc123", "type": "function", "function": {...}}
Location: gateway/run.py ~line 5295-5300 — the "rich agent messages" path passes through all fields except timestamp:
if has_tool_calls or has_tool_call_id or is_tool_message:
clean_msg = {k: v for k, v in msg.items() if k != "timestamp"}
agent_history.append(clean_msg)
This should also strip call_id and response_item_id which are Hermes-internal fields not part of the OpenAI API spec.
2. Content whitespace normalization
During the agentic loop, assistant message content may have trailing newlines (e.g., "Starting the task.\n\n" — 90 chars). But the session store saves the content trimmed (88 chars, no trailing \n\n). When reloaded, the tokenization differs.
3. reasoning vs reasoning_content field naming
The session transcript stores reasoning as reasoning. The gateway history loader preserves this field name:
for _rkey in ("reasoning", "reasoning_details", "codex_reasoning_items"):
_rval = msg.get(_rkey)
if _rval:
entry[_rkey] = _rval
But the API call path in run_agent.py ~line 5598 converts it to reasoning_content:
if reasoning:
api_msg["reasoning_content"] = reasoning
When the chat template processes these messages, the field name difference may affect tokenization on some backends.
Steps to Reproduce
- Configure Hermes with a local llama.cpp backend (lemonade-server) running with
-np 1 (single slot)
- Use a model with the Qwen3.5 chat template (or any template that includes tool_call fields in tokenization)
- Send a message that triggers tool calls (agentic loop)
- Wait for the agent to complete its response
- Send a second message
- Observe in llama.cpp logs:
n_past = 62 (only ~62 tokens match), forcing full context reprocessing
Expected Behavior
The second request should reuse the KV cache for the entire prefix (system prompt + conversation history from the first turn). Only the new tokens (assistant response + new user message, ~258 tokens) should need processing.
Expected: n_past ≈ 100,215 (matching almost all previous tokens)
Actual Behavior
n_past = 62 — only the chat template header tokens match. The entire 100K context is reprocessed from scratch (~10 minutes on the test hardware).
slot update_slots: id 0 | task 240 | n_past = 62, slot.prompt.tokens.size() = 100449
slot update_slots: id 0 | task 240 | forcing full prompt re-processing due to lack of cache data
slot update_slots: id 0 | task 240 | erased invalidated context checkpoint (pos_min = 32767, ...)
slot update_slots: id 0 | task 240 | erased invalidated context checkpoint (pos_min = 65535, ...)
slot update_slots: id 0 | task 240 | erased invalidated context checkpoint (pos_min = 92018, ...)
slot update_slots: id 0 | task 240 | erased invalidated context checkpoint (pos_min = 100210, ...)
Proposed Fix
- Strip Hermes-internal fields from tool_calls on session reload (
gateway/run.py ~line 5295):
if has_tool_calls or has_tool_call_id or is_tool_message:
clean_msg = {k: v for k, v in msg.items() if k not in ("timestamp", "finish_reason")}
# Strip non-standard tool_call fields
if "tool_calls" in clean_msg:
clean_msg["tool_calls"] = [
{k: v for k, v in tc.items() if k in ("id", "type", "function")}
for tc in clean_msg["tool_calls"]
]
agent_history.append(clean_msg)
-
Normalize content whitespace — either preserve exact whitespace in session store, or strip in both paths consistently.
-
Convert reasoning → reasoning_content in the gateway history loader (same transformation as the agentic loop).
The key principle: the messages sent to the API must be byte-identical regardless of whether they come from in-memory agentic iteration or session store reload. Any difference — even an extra JSON field — causes the chat template to produce different tokens, invalidating the entire KV cache.
Bug Description
When using a local llama.cpp backend (via lemonade/llama-server) with a single slot (
-np 1), the KV cache is fully invalidated every time a new user message arrives — even though the cache works perfectly during agentic iteration (tool call loops within the same turn).This causes a full reprocessing of the entire context (~100K tokens, taking ~10 minutes) on every new user message, despite the llama.cpp server having the previous context fully cached.
Root cause: The messages sent to the API during the agentic loop differ from the messages sent when the gateway reloads history from the session store for a new user message. This produces different tokenization via the chat template, invalidating the prefix cache.
Three specific differences were identified:
1. Extra fields in
tool_callsDuring the agentic loop, tool_calls sent to the API have standard fields:
{"id": "abc123", "type": "function", "function": {...}}But after session reload, the gateway history loader passes through
call_idandresponse_item_id:{"id": "abc123", "call_id": "abc123", "response_item_id": "fc_abc123", "type": "function", "function": {...}}Location:
gateway/run.py~line 5295-5300 — the "rich agent messages" path passes through all fields excepttimestamp:This should also strip
call_idandresponse_item_idwhich are Hermes-internal fields not part of the OpenAI API spec.2. Content whitespace normalization
During the agentic loop, assistant message content may have trailing newlines (e.g.,
"Starting the task.\n\n"— 90 chars). But the session store saves the content trimmed (88 chars, no trailing\n\n). When reloaded, the tokenization differs.3.
reasoningvsreasoning_contentfield namingThe session transcript stores reasoning as
reasoning. The gateway history loader preserves this field name:But the API call path in
run_agent.py~line 5598 converts it toreasoning_content:When the chat template processes these messages, the field name difference may affect tokenization on some backends.
Steps to Reproduce
-np 1(single slot)n_past = 62(only ~62 tokens match), forcing full context reprocessingExpected Behavior
The second request should reuse the KV cache for the entire prefix (system prompt + conversation history from the first turn). Only the new tokens (assistant response + new user message, ~258 tokens) should need processing.
Expected:
n_past ≈ 100,215(matching almost all previous tokens)Actual Behavior
n_past = 62— only the chat template header tokens match. The entire 100K context is reprocessed from scratch (~10 minutes on the test hardware).Proposed Fix
gateway/run.py~line 5295):Normalize content whitespace — either preserve exact whitespace in session store, or strip in both paths consistently.
Convert
reasoning→reasoning_contentin the gateway history loader (same transformation as the agentic loop).The key principle: the messages sent to the API must be byte-identical regardless of whether they come from in-memory agentic iteration or session store reload. Any difference — even an extra JSON field — causes the chat template to produce different tokens, invalidating the entire KV cache.