Skip to content

[Bug]: KV cache invalidation on new user message due to message format differences between agentic loop and session reload #4555

@jimmy-claw

Description

@jimmy-claw

Bug Description

When using a local llama.cpp backend (via lemonade/llama-server) with a single slot (-np 1), the KV cache is fully invalidated every time a new user message arrives — even though the cache works perfectly during agentic iteration (tool call loops within the same turn).

This causes a full reprocessing of the entire context (~100K tokens, taking ~10 minutes) on every new user message, despite the llama.cpp server having the previous context fully cached.

Root cause: The messages sent to the API during the agentic loop differ from the messages sent when the gateway reloads history from the session store for a new user message. This produces different tokenization via the chat template, invalidating the prefix cache.

Three specific differences were identified:

1. Extra fields in tool_calls

During the agentic loop, tool_calls sent to the API have standard fields:

{"id": "abc123", "type": "function", "function": {...}}

But after session reload, the gateway history loader passes through call_id and response_item_id:

{"id": "abc123", "call_id": "abc123", "response_item_id": "fc_abc123", "type": "function", "function": {...}}

Location: gateway/run.py ~line 5295-5300 — the "rich agent messages" path passes through all fields except timestamp:

if has_tool_calls or has_tool_call_id or is_tool_message:
    clean_msg = {k: v for k, v in msg.items() if k != "timestamp"}
    agent_history.append(clean_msg)

This should also strip call_id and response_item_id which are Hermes-internal fields not part of the OpenAI API spec.

2. Content whitespace normalization

During the agentic loop, assistant message content may have trailing newlines (e.g., "Starting the task.\n\n" — 90 chars). But the session store saves the content trimmed (88 chars, no trailing \n\n). When reloaded, the tokenization differs.

3. reasoning vs reasoning_content field naming

The session transcript stores reasoning as reasoning. The gateway history loader preserves this field name:

for _rkey in ("reasoning", "reasoning_details", "codex_reasoning_items"):
    _rval = msg.get(_rkey)
    if _rval:
        entry[_rkey] = _rval

But the API call path in run_agent.py ~line 5598 converts it to reasoning_content:

if reasoning:
    api_msg["reasoning_content"] = reasoning

When the chat template processes these messages, the field name difference may affect tokenization on some backends.

Steps to Reproduce

  1. Configure Hermes with a local llama.cpp backend (lemonade-server) running with -np 1 (single slot)
  2. Use a model with the Qwen3.5 chat template (or any template that includes tool_call fields in tokenization)
  3. Send a message that triggers tool calls (agentic loop)
  4. Wait for the agent to complete its response
  5. Send a second message
  6. Observe in llama.cpp logs: n_past = 62 (only ~62 tokens match), forcing full context reprocessing

Expected Behavior

The second request should reuse the KV cache for the entire prefix (system prompt + conversation history from the first turn). Only the new tokens (assistant response + new user message, ~258 tokens) should need processing.

Expected: n_past ≈ 100,215 (matching almost all previous tokens)

Actual Behavior

n_past = 62 — only the chat template header tokens match. The entire 100K context is reprocessed from scratch (~10 minutes on the test hardware).

slot update_slots: id  0 | task 240 | n_past = 62, slot.prompt.tokens.size() = 100449
slot update_slots: id  0 | task 240 | forcing full prompt re-processing due to lack of cache data
slot update_slots: id  0 | task 240 | erased invalidated context checkpoint (pos_min = 32767, ...)
slot update_slots: id  0 | task 240 | erased invalidated context checkpoint (pos_min = 65535, ...)
slot update_slots: id  0 | task 240 | erased invalidated context checkpoint (pos_min = 92018, ...)
slot update_slots: id  0 | task 240 | erased invalidated context checkpoint (pos_min = 100210, ...)

Proposed Fix

  1. Strip Hermes-internal fields from tool_calls on session reload (gateway/run.py ~line 5295):
if has_tool_calls or has_tool_call_id or is_tool_message:
    clean_msg = {k: v for k, v in msg.items() if k not in ("timestamp", "finish_reason")}
    # Strip non-standard tool_call fields
    if "tool_calls" in clean_msg:
        clean_msg["tool_calls"] = [
            {k: v for k, v in tc.items() if k in ("id", "type", "function")}
            for tc in clean_msg["tool_calls"]
        ]
    agent_history.append(clean_msg)
  1. Normalize content whitespace — either preserve exact whitespace in session store, or strip in both paths consistently.

  2. Convert reasoningreasoning_content in the gateway history loader (same transformation as the agentic loop).

The key principle: the messages sent to the API must be byte-identical regardless of whether they come from in-memory agentic iteration or session store reload. Any difference — even an extra JSON field — causes the chat template to produce different tokens, invalidating the entire KV cache.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliverysweeper:implemented-on-mainSweeper: behavior already present on current maintype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions