Skip to content

Hermes-3 8B tool-call template leaks tool calls into content as text under realistic agent shape (Ollama) #2731

@camerono

Description

@camerono

Description

When Hermes-3-Llama-3.1-8B is queried via Ollama's OpenAI-compatible chat-completions endpoint with a realistic agent request shape (multi-tool surface + sender-metadata preamble + long system prompt typical of an agent framework), the model's tool-calls are rendered as a stringified-JSON blob inside content as type: text, not as a structured tool_calls field.

The same Ollama instance on the same host returns textbook-correct structured tool_calls for a single-tool minimal probe (one tool definition, terse user prompt). The degradation is shape-dependent, not model-dependent in the binary sense — Hermes-3 itself is emitting the right tokens, but Ollama's per-model template router fails to extract the tool-call into the structured field once prompt complexity crosses a threshold.

The class of issue is the same one documented for Qwen-family in #13968, #12174, #14493, #15529 and for gemma4 in #15539. Hermes-3 has been widely recommended as the "tool-calling-friendly" Ollama model; this report documents that it is also affected once realistic agent load is applied.

Expected response shape (single-tool minimal probe — works)

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "",
      "tool_calls": [{
        "type": "function",
        "function": {"name": "get_current_weather", "arguments": "{...}"}
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

Actual response shape (under realistic agent shape — fails)

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "{\n \"arguments\": {\n \"message\": \"hello?\"\n },\n \"name\": \"sessions_send\"\n}",
      "tool_calls": null
    },
    "finish_reason": "stop"
  }]
}

The OpenAI-compat client (here, an OpenClaw agent gateway) reads content as plain text, so no tool dispatches and no follow-up turn is produced. The user sees raw JSON in their TUI.

Reproduction Steps

  1. Start Ollama with Hermes-3-Llama-3.1-8B:

    docker run -d --name ollama --gpus all -p 11434:11434 \
      -v $HOME/ollama-data:/root/.ollama \
      -e OLLAMA_KEEP_ALIVE=0 \
      ollama/ollama:latest
    docker exec ollama ollama pull hermes3:8b
  2. Single-tool minimal probe — confirm the structured tool_calls path works:

    curl -sS http://127.0.0.1:11434/v1/chat/completions \
      -H 'Content-Type: application/json' \
      -d '{
        "model": "hermes3:8b",
        "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
        "tools": [{"type":"function","function":{"name":"get_current_weather","parameters":{"type":"object","properties":{"location":{"type":"string"}}}}}]
      }'

    Result: tool_calls populated, content empty, finish_reason: tool_calls. ✓

  3. Realistic agent shape probe — same model, multiple tools, sender-metadata preamble, long system prompt (representative of any agent framework like OpenClaw / LangChain / etc.):

    # See attached realistic-agent-probe.json — sender preamble, ~6 tools
    # (sessions_send, memory_search, web_fetch, exec, etc.), multi-paragraph
    # system prompt of the kind the agent gateway sends in production.
    curl -sS http://127.0.0.1:11434/v1/chat/completions \
      -H 'Content-Type: application/json' \
      -d @realistic-agent-probe.json

    Result: tool_calls: null, content contains stringified JSON like
    {"arguments":{...},"name":"sessions_send"}, finish_reason: stop.

The transition between (2) and (3) is gradual — adding tool definitions and lengthening the system prompt progressively destabilizes the parser. Above some threshold it fails 100% of the time.

Environment

  • Ollama: ollama/ollama:latest (Docker, ARM64), GPU runtime via --gpus all
  • Ollama envs: OLLAMA_FLASH_ATTENTION=1, OLLAMA_KV_CACHE_TYPE=q8_0, OLLAMA_NUM_PARALLEL=2, OLLAMA_KEEP_ALIVE=0
  • Model: hermes3:8b (Hermes-3-Llama-3.1-8B, Q4_K_M GGUF as packaged by Ollama)
  • Hardware: NVIDIA GB10 (DGX Spark), 96 GB unified memory; 28 layers offloaded to GPU
  • OS: Ubuntu 24.04 aarch64
  • Driver / CUDA: NVIDIA 580.142 / CUDA 13.0
  • Caller: OpenClaw 2026.4.9 agent gateway via /v1/chat/completions (OpenAI-compat), behind an auth-proxy (token check only — payload unmodified)

Debug Output

Logs

Captured 2026-04-29 19:17–19:21 UTC from a live agent session (OpenClaw rtfm sandbox, four user prompts). All four assistant messages have `content` as stringified JSON tool-call, no `tool_calls` field, `stopReason: "stop"`:


{"type":"message","id":"d0ab96a5","timestamp":"2026-04-29T19:17:14.840Z","message":{"role":"assistant","content":[{"type":"text","text":"{\n \"arguments\": {\n \"message\": \"hello?\"\n },\n \"name\": \"sessions_send\"\n}"}],"api":"openai-completions","model":"hermes3:8b","stopReason":"stop","responseId":"chatcmpl-976"}}
{"type":"message","id":"f2e95dc7","timestamp":"2026-04-29T19:20:22.687Z","message":{"role":"assistant","content":[{"type":"text","text":"{\n \"arguments\": {\n \"query\": \"What is 2 + 2?\"\n },\n \"name\": \"memory_search\"\n}"}],"api":"openai-completions","model":"hermes3:8b","stopReason":"stop","responseId":"chatcmpl-326"}}
{"type":"message","id":"7dd727ce","timestamp":"2026-04-29T19:20:55.344Z","message":{"role":"assistant","content":[{"type":"text","text":"{\n \"arguments\": {\n \"query\": \"Tell me one fact about robotics in one sentence.\"\n },\n \"name\": \"memory_search\"\n}"}],"api":"openai-completions","model":"hermes3:8b","stopReason":"stop","responseId":"chatcmpl-678"}}
{"type":"message","id":"af9d6e2a","timestamp":"2026-04-29T19:21:32.981Z","message":{"role":"assistant","content":[{"type":"text","text":"{\n \"arguments\": {\n \"message\": \"ok\"\n },\n \"name\": \"sessions_send\"\n}"}],"api":"openai-completions","model":"hermes3:8b","stopReason":"stop","responseId":"chatcmpl-134"}}


Note `content[0].type: "text"` and `text` containing a literal stringified JSON object with `arguments` and `name` keys — the exact shape the Hermes-3 chat template emits, but unwrapped from its tool-call markers and not extracted by Ollama into `tool_calls`.

### Comparison: same prompt shape on vLLM with `--tool-call-parser hermes`

We migrated to `vllm/vllm-openai:latest` with `--enable-auto-tool-choice --tool-call-parser hermes` against the same `Hermes-3-Llama-3.1-8B` weights, and the agent's realistic shape now produces:

- For tool-warranted prompts: structured `tool_calls`, empty `content`, `finish_reason: tool_calls`.
- For non-tool prompts ("What is 2+2?"): `tool_calls: []`, `content: "4"`, `finish_reason: stop`.

This indicates the model is emitting the right tokens; the issue is the per-model parser inside Ollama not extracting them.

Checklist

  • I confirmed this bug is reproducible
  • I searched existing issues and this is not a duplicate

Metadata

Metadata

Assignees

Labels

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions