Description
When Hermes-3-Llama-3.1-8B is queried via Ollama's OpenAI-compatible chat-completions endpoint with a realistic agent request shape (multi-tool surface + sender-metadata preamble + long system prompt typical of an agent framework), the model's tool-calls are rendered as a stringified-JSON blob inside content as type: text, not as a structured tool_calls field.
The same Ollama instance on the same host returns textbook-correct structured tool_calls for a single-tool minimal probe (one tool definition, terse user prompt). The degradation is shape-dependent, not model-dependent in the binary sense — Hermes-3 itself is emitting the right tokens, but Ollama's per-model template router fails to extract the tool-call into the structured field once prompt complexity crosses a threshold.
The class of issue is the same one documented for Qwen-family in #13968, #12174, #14493, #15529 and for gemma4 in #15539. Hermes-3 has been widely recommended as the "tool-calling-friendly" Ollama model; this report documents that it is also affected once realistic agent load is applied.
Expected response shape (single-tool minimal probe — works)
{
"choices": [{
"message": {
"role": "assistant",
"content": "",
"tool_calls": [{
"type": "function",
"function": {"name": "get_current_weather", "arguments": "{...}"}
}]
},
"finish_reason": "tool_calls"
}]
}
Actual response shape (under realistic agent shape — fails)
{
"choices": [{
"message": {
"role": "assistant",
"content": "{\n \"arguments\": {\n \"message\": \"hello?\"\n },\n \"name\": \"sessions_send\"\n}",
"tool_calls": null
},
"finish_reason": "stop"
}]
}
The OpenAI-compat client (here, an OpenClaw agent gateway) reads content as plain text, so no tool dispatches and no follow-up turn is produced. The user sees raw JSON in their TUI.
Reproduction Steps
-
Start Ollama with Hermes-3-Llama-3.1-8B:
docker run -d --name ollama --gpus all -p 11434:11434 \
-v $HOME/ollama-data:/root/.ollama \
-e OLLAMA_KEEP_ALIVE=0 \
ollama/ollama:latest
docker exec ollama ollama pull hermes3:8b
-
Single-tool minimal probe — confirm the structured tool_calls path works:
curl -sS http://127.0.0.1:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "hermes3:8b",
"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
"tools": [{"type":"function","function":{"name":"get_current_weather","parameters":{"type":"object","properties":{"location":{"type":"string"}}}}}]
}'
Result: tool_calls populated, content empty, finish_reason: tool_calls. ✓
-
Realistic agent shape probe — same model, multiple tools, sender-metadata preamble, long system prompt (representative of any agent framework like OpenClaw / LangChain / etc.):
# See attached realistic-agent-probe.json — sender preamble, ~6 tools
# (sessions_send, memory_search, web_fetch, exec, etc.), multi-paragraph
# system prompt of the kind the agent gateway sends in production.
curl -sS http://127.0.0.1:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d @realistic-agent-probe.json
Result: tool_calls: null, content contains stringified JSON like
{"arguments":{...},"name":"sessions_send"}, finish_reason: stop.
The transition between (2) and (3) is gradual — adding tool definitions and lengthening the system prompt progressively destabilizes the parser. Above some threshold it fails 100% of the time.
Environment
- Ollama:
ollama/ollama:latest (Docker, ARM64), GPU runtime via --gpus all
- Ollama envs:
OLLAMA_FLASH_ATTENTION=1, OLLAMA_KV_CACHE_TYPE=q8_0, OLLAMA_NUM_PARALLEL=2, OLLAMA_KEEP_ALIVE=0
- Model:
hermes3:8b (Hermes-3-Llama-3.1-8B, Q4_K_M GGUF as packaged by Ollama)
- Hardware: NVIDIA GB10 (DGX Spark), 96 GB unified memory; 28 layers offloaded to GPU
- OS: Ubuntu 24.04 aarch64
- Driver / CUDA: NVIDIA 580.142 / CUDA 13.0
- Caller: OpenClaw 2026.4.9 agent gateway via
/v1/chat/completions (OpenAI-compat), behind an auth-proxy (token check only — payload unmodified)
Debug Output
Logs
Captured 2026-04-29 19:17–19:21 UTC from a live agent session (OpenClaw rtfm sandbox, four user prompts). All four assistant messages have `content` as stringified JSON tool-call, no `tool_calls` field, `stopReason: "stop"`:
{"type":"message","id":"d0ab96a5","timestamp":"2026-04-29T19:17:14.840Z","message":{"role":"assistant","content":[{"type":"text","text":"{\n \"arguments\": {\n \"message\": \"hello?\"\n },\n \"name\": \"sessions_send\"\n}"}],"api":"openai-completions","model":"hermes3:8b","stopReason":"stop","responseId":"chatcmpl-976"}}
{"type":"message","id":"f2e95dc7","timestamp":"2026-04-29T19:20:22.687Z","message":{"role":"assistant","content":[{"type":"text","text":"{\n \"arguments\": {\n \"query\": \"What is 2 + 2?\"\n },\n \"name\": \"memory_search\"\n}"}],"api":"openai-completions","model":"hermes3:8b","stopReason":"stop","responseId":"chatcmpl-326"}}
{"type":"message","id":"7dd727ce","timestamp":"2026-04-29T19:20:55.344Z","message":{"role":"assistant","content":[{"type":"text","text":"{\n \"arguments\": {\n \"query\": \"Tell me one fact about robotics in one sentence.\"\n },\n \"name\": \"memory_search\"\n}"}],"api":"openai-completions","model":"hermes3:8b","stopReason":"stop","responseId":"chatcmpl-678"}}
{"type":"message","id":"af9d6e2a","timestamp":"2026-04-29T19:21:32.981Z","message":{"role":"assistant","content":[{"type":"text","text":"{\n \"arguments\": {\n \"message\": \"ok\"\n },\n \"name\": \"sessions_send\"\n}"}],"api":"openai-completions","model":"hermes3:8b","stopReason":"stop","responseId":"chatcmpl-134"}}
Note `content[0].type: "text"` and `text` containing a literal stringified JSON object with `arguments` and `name` keys — the exact shape the Hermes-3 chat template emits, but unwrapped from its tool-call markers and not extracted by Ollama into `tool_calls`.
### Comparison: same prompt shape on vLLM with `--tool-call-parser hermes`
We migrated to `vllm/vllm-openai:latest` with `--enable-auto-tool-choice --tool-call-parser hermes` against the same `Hermes-3-Llama-3.1-8B` weights, and the agent's realistic shape now produces:
- For tool-warranted prompts: structured `tool_calls`, empty `content`, `finish_reason: tool_calls`.
- For non-tool prompts ("What is 2+2?"): `tool_calls: []`, `content: "4"`, `finish_reason: stop`.
This indicates the model is emitting the right tokens; the issue is the per-model parser inside Ollama not extracting them.
Checklist
Description
When
Hermes-3-Llama-3.1-8Bis queried via Ollama's OpenAI-compatible chat-completions endpoint with a realistic agent request shape (multi-tool surface + sender-metadata preamble + long system prompt typical of an agent framework), the model's tool-calls are rendered as a stringified-JSON blob insidecontentastype: text, not as a structuredtool_callsfield.The same Ollama instance on the same host returns textbook-correct structured
tool_callsfor a single-tool minimal probe (one tool definition, terse user prompt). The degradation is shape-dependent, not model-dependent in the binary sense — Hermes-3 itself is emitting the right tokens, but Ollama's per-model template router fails to extract the tool-call into the structured field once prompt complexity crosses a threshold.The class of issue is the same one documented for Qwen-family in #13968, #12174, #14493, #15529 and for gemma4 in #15539. Hermes-3 has been widely recommended as the "tool-calling-friendly" Ollama model; this report documents that it is also affected once realistic agent load is applied.
Expected response shape (single-tool minimal probe — works)
{ "choices": [{ "message": { "role": "assistant", "content": "", "tool_calls": [{ "type": "function", "function": {"name": "get_current_weather", "arguments": "{...}"} }] }, "finish_reason": "tool_calls" }] }Actual response shape (under realistic agent shape — fails)
{ "choices": [{ "message": { "role": "assistant", "content": "{\n \"arguments\": {\n \"message\": \"hello?\"\n },\n \"name\": \"sessions_send\"\n}", "tool_calls": null }, "finish_reason": "stop" }] }The OpenAI-compat client (here, an OpenClaw agent gateway) reads
contentas plain text, so no tool dispatches and no follow-up turn is produced. The user sees raw JSON in their TUI.Reproduction Steps
Start Ollama with
Hermes-3-Llama-3.1-8B:Single-tool minimal probe — confirm the structured tool_calls path works:
Result:
tool_callspopulated,contentempty,finish_reason: tool_calls. ✓Realistic agent shape probe — same model, multiple tools, sender-metadata preamble, long system prompt (representative of any agent framework like OpenClaw / LangChain / etc.):
Result:
tool_calls: null,contentcontains stringified JSON like{"arguments":{...},"name":"sessions_send"},finish_reason: stop.The transition between (2) and (3) is gradual — adding tool definitions and lengthening the system prompt progressively destabilizes the parser. Above some threshold it fails 100% of the time.
Environment
ollama/ollama:latest(Docker, ARM64), GPU runtime via--gpus allOLLAMA_FLASH_ATTENTION=1,OLLAMA_KV_CACHE_TYPE=q8_0,OLLAMA_NUM_PARALLEL=2,OLLAMA_KEEP_ALIVE=0hermes3:8b(Hermes-3-Llama-3.1-8B, Q4_K_M GGUF as packaged by Ollama)/v1/chat/completions(OpenAI-compat), behind an auth-proxy (token check only — payload unmodified)Debug Output
Logs
Checklist