Hermes-3 8B tool-call template leaks tool calls into `content` as text under realistic agent shape (Ollama)

### Description

When `Hermes-3-Llama-3.1-8B` is queried via Ollama's OpenAI-compatible chat-completions endpoint with a **realistic agent request shape** (multi-tool surface + sender-metadata preamble + long system prompt typical of an agent framework), the model's tool-calls are **rendered as a stringified-JSON blob inside `content` as `type: text`**, not as a structured `tool_calls` field.

The same Ollama instance on the same host returns textbook-correct structured `tool_calls` for a **single-tool minimal probe** (one tool definition, terse user prompt). The degradation is shape-dependent, not model-dependent in the binary sense — Hermes-3 itself is emitting the right tokens, but Ollama's per-model template router fails to extract the tool-call into the structured field once prompt complexity crosses a threshold.

The class of issue is the same one documented for Qwen-family in #13968, #12174, #14493, #15529 and for gemma4 in #15539. Hermes-3 has been widely recommended as the "tool-calling-friendly" Ollama model; this report documents that it is also affected once realistic agent load is applied.

### Expected response shape (single-tool minimal probe — works)
```json
{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "",
      "tool_calls": [{
        "type": "function",
        "function": {"name": "get_current_weather", "arguments": "{...}"}
      }]
    },
    "finish_reason": "tool_calls"
  }]
}
```

### Actual response shape (under realistic agent shape — fails)
```json
{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "{\n \"arguments\": {\n \"message\": \"hello?\"\n },\n \"name\": \"sessions_send\"\n}",
      "tool_calls": null
    },
    "finish_reason": "stop"
  }]
}
```

The OpenAI-compat client (here, an OpenClaw agent gateway) reads `content` as plain text, so no tool dispatches and no follow-up turn is produced. The user sees raw JSON in their TUI.


### Reproduction Steps


1. Start Ollama with `Hermes-3-Llama-3.1-8B`:
   ```sh
   docker run -d --name ollama --gpus all -p 11434:11434 \
     -v $HOME/ollama-data:/root/.ollama \
     -e OLLAMA_KEEP_ALIVE=0 \
     ollama/ollama:latest
   docker exec ollama ollama pull hermes3:8b
   ```

2. **Single-tool minimal probe** — confirm the structured tool_calls path works:
   ```sh
   curl -sS http://127.0.0.1:11434/v1/chat/completions \
     -H 'Content-Type: application/json' \
     -d '{
       "model": "hermes3:8b",
       "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
       "tools": [{"type":"function","function":{"name":"get_current_weather","parameters":{"type":"object","properties":{"location":{"type":"string"}}}}}]
     }'
   ```
   Result: `tool_calls` populated, `content` empty, `finish_reason: tool_calls`. ✓

3. **Realistic agent shape probe** — same model, multiple tools, sender-metadata preamble, long system prompt (representative of any agent framework like OpenClaw / LangChain / etc.):
   ```sh
   # See attached realistic-agent-probe.json — sender preamble, ~6 tools
   # (sessions_send, memory_search, web_fetch, exec, etc.), multi-paragraph
   # system prompt of the kind the agent gateway sends in production.
   curl -sS http://127.0.0.1:11434/v1/chat/completions \
     -H 'Content-Type: application/json' \
     -d @realistic-agent-probe.json
   ```
   Result: `tool_calls: null`, `content` contains stringified JSON like
   `{"arguments":{...},"name":"sessions_send"}`, `finish_reason: stop`.

The transition between (2) and (3) is gradual — adding tool definitions and lengthening the system prompt progressively destabilizes the parser. Above some threshold it fails 100% of the time.

### Environment


- Ollama: `ollama/ollama:latest` (Docker, ARM64), GPU runtime via `--gpus all`
- Ollama envs: `OLLAMA_FLASH_ATTENTION=1`, `OLLAMA_KV_CACHE_TYPE=q8_0`, `OLLAMA_NUM_PARALLEL=2`, `OLLAMA_KEEP_ALIVE=0`
- Model: `hermes3:8b` (Hermes-3-Llama-3.1-8B, Q4_K_M GGUF as packaged by Ollama)
- Hardware: NVIDIA GB10 (DGX Spark), 96 GB unified memory; 28 layers offloaded to GPU
- OS: Ubuntu 24.04 aarch64
- Driver / CUDA: NVIDIA 580.142 / CUDA 13.0
- Caller: OpenClaw 2026.4.9 agent gateway via `/v1/chat/completions` (OpenAI-compat), behind an auth-proxy (token check only — payload unmodified)

### Debug Output

```shell

```

### Logs

```shell
Captured 2026-04-29 19:17–19:21 UTC from a live agent session (OpenClaw rtfm sandbox, four user prompts). All four assistant messages have `content` as stringified JSON tool-call, no `tool_calls` field, `stopReason: "stop"`:


{"type":"message","id":"d0ab96a5","timestamp":"2026-04-29T19:17:14.840Z","message":{"role":"assistant","content":[{"type":"text","text":"{\n \"arguments\": {\n \"message\": \"hello?\"\n },\n \"name\": \"sessions_send\"\n}"}],"api":"openai-completions","model":"hermes3:8b","stopReason":"stop","responseId":"chatcmpl-976"}}
{"type":"message","id":"f2e95dc7","timestamp":"2026-04-29T19:20:22.687Z","message":{"role":"assistant","content":[{"type":"text","text":"{\n \"arguments\": {\n \"query\": \"What is 2 + 2?\"\n },\n \"name\": \"memory_search\"\n}"}],"api":"openai-completions","model":"hermes3:8b","stopReason":"stop","responseId":"chatcmpl-326"}}
{"type":"message","id":"7dd727ce","timestamp":"2026-04-29T19:20:55.344Z","message":{"role":"assistant","content":[{"type":"text","text":"{\n \"arguments\": {\n \"query\": \"Tell me one fact about robotics in one sentence.\"\n },\n \"name\": \"memory_search\"\n}"}],"api":"openai-completions","model":"hermes3:8b","stopReason":"stop","responseId":"chatcmpl-678"}}
{"type":"message","id":"af9d6e2a","timestamp":"2026-04-29T19:21:32.981Z","message":{"role":"assistant","content":[{"type":"text","text":"{\n \"arguments\": {\n \"message\": \"ok\"\n },\n \"name\": \"sessions_send\"\n}"}],"api":"openai-completions","model":"hermes3:8b","stopReason":"stop","responseId":"chatcmpl-134"}}


Note `content[0].type: "text"` and `text` containing a literal stringified JSON object with `arguments` and `name` keys — the exact shape the Hermes-3 chat template emits, but unwrapped from its tool-call markers and not extracted by Ollama into `tool_calls`.

### Comparison: same prompt shape on vLLM with `--tool-call-parser hermes`

We migrated to `vllm/vllm-openai:latest` with `--enable-auto-tool-choice --tool-call-parser hermes` against the same `Hermes-3-Llama-3.1-8B` weights, and the agent's realistic shape now produces:

- For tool-warranted prompts: structured `tool_calls`, empty `content`, `finish_reason: tool_calls`.
- For non-tool prompts ("What is 2+2?"): `tool_calls: []`, `content: "4"`, `finish_reason: stop`.

This indicates the model is emitting the right tokens; the issue is the per-model parser inside Ollama not extracting them.
```

### Checklist

- [x] I confirmed this bug is reproducible
- [x] I searched existing issues and this is not a duplicate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hermes-3 8B tool-call template leaks tool calls into `content` as text under realistic agent shape (Ollama) #2731

Description

Expected response shape (single-tool minimal probe — works)

Actual response shape (under realistic agent shape — fails)

Reproduction Steps

Environment

Debug Output

Logs

Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Hermes-3 8B tool-call template leaks tool calls into content as text under realistic agent shape (Ollama) #2731

Description

Description

Expected response shape (single-tool minimal probe — works)

Actual response shape (under realistic agent shape — fails)

Reproduction Steps

Environment

Debug Output

Logs

Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Hermes-3 8B tool-call template leaks tool calls into `content` as text under realistic agent shape (Ollama) #2731