Bug Description
When Hermes Agent talks to a custom OpenAI-compatible inference server (e.g. local llama.cpp / llama-server, vLLM, etc.) configured via custom_providers, two important request fields silently drop on the floor in the chat_completions transport:
temperature — never set, so the backend server's default takes over. With llama.cpp this means temperature=1.0, which produces factual drift on grounded tasks. Example from a real run: the model invented "3.3B active params" (correct: 3B) and described a 20GB GGUF as "sub-4GB" (conflating active-param footprint with file size).
parallel_tool_calls — never set, so even when the underlying model and chat template support parallel tool calls, every multi-tool query serializes into N sequential assistant turns. A 4-file read fires as 4 separate turns instead of 1, ~4× the latency.
Both fields ARE handled correctly for cloud providers — the Codex/Responses transport hardcodes parallel_tool_calls: True, and specific model families (GPT-5, Codex, etc.) get fixed_temperature via _fixed_temperature_for_model(). The gap is specifically in the chat_completions transport's handling of the custom-provider path.
Steps to Reproduce
-
Start a local OpenAI-compatible server. Example with llama-server:
llama-server --model qwen3.6-35b-a3b.gguf --jinja --host 127.0.0.1 --port 8080
(default temperature is 1.0; verify via curl http://127.0.0.1:8080/props)
-
Configure Hermes Agent with a custom provider in ~/.hermes/config.yaml:
custom_providers:
- name: Local
base_url: http://localhost:8080/v1
model:
-
Switch to that provider and run a multi-step prompt that should trigger parallel tool use, e.g.:
"Read these 4 files (a.py, b.py, c.py, d.py), then summarize each one."
-
Observe two issues:
- Output contains factual drift on grounded claims (caused by temp=1.0)
- The four read_file calls fire one at a time (preparing → read → preparing → read → ...) rather than as one batched assistant message with 4 tool_calls
Expected Behavior
- A sensible temperature default (e.g. 0.2-0.3) is set for agent workloads on custom providers, OR
temperature is exposed as a per-provider config field so users can set it without restarting the inference server.
parallel_tool_calls: true is sent by default on outbound /v1/chat/completions requests when tools are present, matching the Codex transport (transports/codex.py:98) and the OpenAI API spec.
Actual Behavior
- No
temperature field is sent on outbound requests. llama.cpp falls back to its default of 1.0 (verified via /props: default_generation_settings.temperature = 1.0). Model produces factually drifted output.
- No
parallel_tool_calls field is sent. Even when the chat template advertises support (chat_template_caps.supports_parallel_tool_calls: true per /props), the model serializes tool calls into separate turns. A direct probe against the same llama-server shows that adding parallel_tool_calls:true to the request body produces 3 tool_calls in one assistant turn; without it, only 1. Behavior is fully reproducible.
Affected Component
Agent Core (conversation loop, context compression, memory), Configuration (config.yaml, .env, hermes setup)
Messaging Platform (if gateway-related)
N/A (CLI only)
Debug Report
Report https://paste.rs/ttMcl
agent.log https://paste.rs/B1DTq
gateway.log https://paste.rs/GLzg9
Operating System
Ubuntu 24.04.4 LTS
Python Version
Python 3.11.15
Hermes Version
0.12.0 (2026.4.30)
Additional Logs / Traceback (optional)
Root Cause Analysis (optional)
Issue 1 — temperature:
agent/transports/chat_completions.py:245-251 only adds temperature to the request when fixed_temperature is provided:
# Temperature
fixed_temp = params.get("fixed_temperature")
omit_temp = params.get("omit_temperature", False)
if omit_temp:
api_kwargs.pop("temperature", None)
elif fixed_temp is not None:
api_kwargs["temperature"] = fixed_temp
fixed_temperature is populated by _fixed_temperature_for_model(), which only returns a value for specific cloud model families (GPT-5, Codex, etc.). Custom OpenAI-compatible providers never hit that branch, so temperature is never added to api_kwargs and the server's default is used.
Issue 2 — parallel_tool_calls:
A repo-wide grep for parallel_tool_calls returns only two hits:
agent/transports/codex.py:98 — hardcoded parallel_tool_calls: True for the Codex/Responses transport
agent/codex_responses_adapter.py:677,708-709 — passthrough handling for the Codex Responses adapter
The chat_completions transport (the one custom providers use) never sets the field. With OpenAI's spec defaulting to true but llama.cpp (and many other backends) requiring it explicitly, the omission produces serial-only tool calling on local stacks.
Proposed Fix (optional)
Two minimal changes in agent/transports/chat_completions.py:
-
Default temperature for custom providers. Around line 251, when fixed_temperature is None AND is_custom_provider is True, set a sane default:
elif params.get("is_custom_provider"):
api_kwargs["temperature"] = params.get("temperature", 0.2)
-
Default parallel_tool_calls when tools are present. Around line 265, after api_kwargs["tools"] = tools, add:
api_kwargs.setdefault("parallel_tool_calls", True)
A more configurable variant: add optional temperature and parallel_tool_calls fields to the custom_providers schema:
custom_providers:
- name: Local
base_url: http://localhost:8080/v1
model:
temperature: 0.2 # new
parallel_tool_calls: true # new
Happy to send a PR if the maintainers prefer one approach over the other.
Are you willing to submit a PR for this?
Bug Description
When Hermes Agent talks to a custom OpenAI-compatible inference server (e.g. local llama.cpp / llama-server, vLLM, etc.) configured via
custom_providers, two important request fields silently drop on the floor in the chat_completions transport:temperature— never set, so the backend server's default takes over. With llama.cpp this means temperature=1.0, which produces factual drift on grounded tasks. Example from a real run: the model invented "3.3B active params" (correct: 3B) and described a 20GB GGUF as "sub-4GB" (conflating active-param footprint with file size).parallel_tool_calls— never set, so even when the underlying model and chat template support parallel tool calls, every multi-tool query serializes into N sequential assistant turns. A 4-file read fires as 4 separate turns instead of 1, ~4× the latency.Both fields ARE handled correctly for cloud providers — the Codex/Responses transport hardcodes
parallel_tool_calls: True, and specific model families (GPT-5, Codex, etc.) getfixed_temperaturevia_fixed_temperature_for_model(). The gap is specifically in the chat_completions transport's handling of the custom-provider path.Steps to Reproduce
Start a local OpenAI-compatible server. Example with llama-server:
llama-server --model qwen3.6-35b-a3b.gguf --jinja --host 127.0.0.1 --port 8080
(default temperature is 1.0; verify via
curl http://127.0.0.1:8080/props)Configure Hermes Agent with a custom provider in ~/.hermes/config.yaml:
custom_providers:
- name: Local
base_url: http://localhost:8080/v1
model:
Switch to that provider and run a multi-step prompt that should trigger parallel tool use, e.g.:
"Read these 4 files (a.py, b.py, c.py, d.py), then summarize each one."
Observe two issues:
Expected Behavior
temperatureis exposed as a per-provider config field so users can set it without restarting the inference server.parallel_tool_calls: trueis sent by default on outbound /v1/chat/completions requests when tools are present, matching the Codex transport (transports/codex.py:98) and the OpenAI API spec.Actual Behavior
temperaturefield is sent on outbound requests. llama.cpp falls back to its default of 1.0 (verified via /props: default_generation_settings.temperature = 1.0). Model produces factually drifted output.parallel_tool_callsfield is sent. Even when the chat template advertises support (chat_template_caps.supports_parallel_tool_calls: true per /props), the model serializes tool calls into separate turns. A direct probe against the same llama-server shows that adding parallel_tool_calls:true to the request body produces 3 tool_calls in one assistant turn; without it, only 1. Behavior is fully reproducible.Affected Component
Agent Core (conversation loop, context compression, memory), Configuration (config.yaml, .env, hermes setup)
Messaging Platform (if gateway-related)
N/A (CLI only)
Debug Report
Operating System
Ubuntu 24.04.4 LTS
Python Version
Python 3.11.15
Hermes Version
0.12.0 (2026.4.30)
Additional Logs / Traceback (optional)
Root Cause Analysis (optional)
Issue 1 — temperature:
agent/transports/chat_completions.py:245-251only addstemperatureto the request whenfixed_temperatureis provided:fixed_temperatureis populated by_fixed_temperature_for_model(), which only returns a value for specific cloud model families (GPT-5, Codex, etc.). Custom OpenAI-compatible providers never hit that branch, sotemperatureis never added to api_kwargs and the server's default is used.Issue 2 — parallel_tool_calls:
A repo-wide grep for
parallel_tool_callsreturns only two hits:agent/transports/codex.py:98— hardcodedparallel_tool_calls: Truefor the Codex/Responses transportagent/codex_responses_adapter.py:677,708-709— passthrough handling for the Codex Responses adapterThe chat_completions transport (the one custom providers use) never sets the field. With OpenAI's spec defaulting to
truebut llama.cpp (and many other backends) requiring it explicitly, the omission produces serial-only tool calling on local stacks.Proposed Fix (optional)
Two minimal changes in agent/transports/chat_completions.py:
Default temperature for custom providers. Around line 251, when
fixed_temperatureis None ANDis_custom_provideris True, set a sane default:elif params.get("is_custom_provider"):
api_kwargs["temperature"] = params.get("temperature", 0.2)
Default parallel_tool_calls when tools are present. Around line 265, after
api_kwargs["tools"] = tools, add:api_kwargs.setdefault("parallel_tool_calls", True)
A more configurable variant: add optional
temperatureandparallel_tool_callsfields to the custom_providers schema:custom_providers:
- name: Local
base_url: http://localhost:8080/v1
model:
temperature: 0.2 # new
parallel_tool_calls: true # new
Happy to send a PR if the maintainers prefer one approach over the other.
Are you willing to submit a PR for this?