[Bug]: KV cache invalidation on new user message due to message format differences between agentic loop and session reload

## Bug Description

When using a local llama.cpp backend (via lemonade/llama-server) with a single slot (`-np 1`), the KV cache is fully invalidated every time a **new user message** arrives — even though the cache works perfectly during agentic iteration (tool call loops within the same turn).

This causes a full reprocessing of the entire context (~100K tokens, taking ~10 minutes) on every new user message, despite the llama.cpp server having the previous context fully cached.

**Root cause:** The messages sent to the API during the agentic loop differ from the messages sent when the gateway reloads history from the session store for a new user message. This produces different tokenization via the chat template, invalidating the prefix cache.

Three specific differences were identified:

### 1. Extra fields in `tool_calls` 

During the agentic loop, tool_calls sent to the API have standard fields:
```json
{"id": "abc123", "type": "function", "function": {...}}
```

But after session reload, the gateway history loader passes through `call_id` and `response_item_id`:
```json
{"id": "abc123", "call_id": "abc123", "response_item_id": "fc_abc123", "type": "function", "function": {...}}
```

**Location:** `gateway/run.py` ~line 5295-5300 — the "rich agent messages" path passes through all fields except `timestamp`:
```python
if has_tool_calls or has_tool_call_id or is_tool_message:
    clean_msg = {k: v for k, v in msg.items() if k != "timestamp"}
    agent_history.append(clean_msg)
```

This should also strip `call_id` and `response_item_id` which are Hermes-internal fields not part of the OpenAI API spec.

### 2. Content whitespace normalization

During the agentic loop, assistant message content may have trailing newlines (e.g., `"Starting the task.\n\n"` — 90 chars). But the session store saves the content trimmed (88 chars, no trailing `\n\n`). When reloaded, the tokenization differs.

### 3. `reasoning` vs `reasoning_content` field naming

The session transcript stores reasoning as `reasoning`. The gateway history loader preserves this field name:
```python
for _rkey in ("reasoning", "reasoning_details", "codex_reasoning_items"):
    _rval = msg.get(_rkey)
    if _rval:
        entry[_rkey] = _rval
```

But the API call path in `run_agent.py` ~line 5598 converts it to `reasoning_content`:
```python
if reasoning:
    api_msg["reasoning_content"] = reasoning
```

When the chat template processes these messages, the field name difference may affect tokenization on some backends.

## Steps to Reproduce

1. Configure Hermes with a local llama.cpp backend (lemonade-server) running with `-np 1` (single slot)
2. Use a model with the Qwen3.5 chat template (or any template that includes tool_call fields in tokenization)
3. Send a message that triggers tool calls (agentic loop)
4. Wait for the agent to complete its response
5. Send a second message
6. Observe in llama.cpp logs: `n_past = 62` (only ~62 tokens match), forcing full context reprocessing

## Expected Behavior

The second request should reuse the KV cache for the entire prefix (system prompt + conversation history from the first turn). Only the new tokens (assistant response + new user message, ~258 tokens) should need processing.

Expected: `n_past ≈ 100,215` (matching almost all previous tokens)

## Actual Behavior

`n_past = 62` — only the chat template header tokens match. The entire 100K context is reprocessed from scratch (~10 minutes on the test hardware).

```
slot update_slots: id  0 | task 240 | n_past = 62, slot.prompt.tokens.size() = 100449
slot update_slots: id  0 | task 240 | forcing full prompt re-processing due to lack of cache data
slot update_slots: id  0 | task 240 | erased invalidated context checkpoint (pos_min = 32767, ...)
slot update_slots: id  0 | task 240 | erased invalidated context checkpoint (pos_min = 65535, ...)
slot update_slots: id  0 | task 240 | erased invalidated context checkpoint (pos_min = 92018, ...)
slot update_slots: id  0 | task 240 | erased invalidated context checkpoint (pos_min = 100210, ...)
```

## Proposed Fix

1. **Strip Hermes-internal fields from tool_calls on session reload** (`gateway/run.py` ~line 5295):
```python
if has_tool_calls or has_tool_call_id or is_tool_message:
    clean_msg = {k: v for k, v in msg.items() if k not in ("timestamp", "finish_reason")}
    # Strip non-standard tool_call fields
    if "tool_calls" in clean_msg:
        clean_msg["tool_calls"] = [
            {k: v for k, v in tc.items() if k in ("id", "type", "function")}
            for tc in clean_msg["tool_calls"]
        ]
    agent_history.append(clean_msg)
```

2. **Normalize content whitespace** — either preserve exact whitespace in session store, or strip in both paths consistently.

3. **Convert `reasoning` → `reasoning_content`** in the gateway history loader (same transformation as the agentic loop).

The key principle: **the messages sent to the API must be byte-identical regardless of whether they come from in-memory agentic iteration or session store reload.** Any difference — even an extra JSON field — causes the chat template to produce different tokens, invalidating the entire KV cache.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: KV cache invalidation on new user message due to message format differences between agentic loop and session reload #4555

Bug Description

1. Extra fields in `tool_calls`

2. Content whitespace normalization

3. `reasoning` vs `reasoning_content` field naming

Steps to Reproduce

Expected Behavior

Actual Behavior

Proposed Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: KV cache invalidation on new user message due to message format differences between agentic loop and session reload #4555

Description

Bug Description

1. Extra fields in tool_calls

2. Content whitespace normalization

3. reasoning vs reasoning_content field naming

Steps to Reproduce

Expected Behavior

Actual Behavior

Proposed Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Extra fields in `tool_calls`

3. `reasoning` vs `reasoning_content` field naming