Skip to content

[Bug]: Custom OpenAI-compatible providers: temperature and parallel_tool_calls request fields not propagated #18470

@abovespec

Description

@abovespec

Bug Description

When Hermes Agent talks to a custom OpenAI-compatible inference server (e.g. local llama.cpp / llama-server, vLLM, etc.) configured via custom_providers, two important request fields silently drop on the floor in the chat_completions transport:

  1. temperature — never set, so the backend server's default takes over. With llama.cpp this means temperature=1.0, which produces factual drift on grounded tasks. Example from a real run: the model invented "3.3B active params" (correct: 3B) and described a 20GB GGUF as "sub-4GB" (conflating active-param footprint with file size).
  2. parallel_tool_calls — never set, so even when the underlying model and chat template support parallel tool calls, every multi-tool query serializes into N sequential assistant turns. A 4-file read fires as 4 separate turns instead of 1, ~4× the latency.

Both fields ARE handled correctly for cloud providers — the Codex/Responses transport hardcodes parallel_tool_calls: True, and specific model families (GPT-5, Codex, etc.) get fixed_temperature via _fixed_temperature_for_model(). The gap is specifically in the chat_completions transport's handling of the custom-provider path.

Steps to Reproduce

  1. Start a local OpenAI-compatible server. Example with llama-server:
    llama-server --model qwen3.6-35b-a3b.gguf --jinja --host 127.0.0.1 --port 8080
    (default temperature is 1.0; verify via curl http://127.0.0.1:8080/props)

  2. Configure Hermes Agent with a custom provider in ~/.hermes/config.yaml:
    custom_providers:
    - name: Local
    base_url: http://localhost:8080/v1
    model:

  3. Switch to that provider and run a multi-step prompt that should trigger parallel tool use, e.g.:
    "Read these 4 files (a.py, b.py, c.py, d.py), then summarize each one."

  4. Observe two issues:

    • Output contains factual drift on grounded claims (caused by temp=1.0)
    • The four read_file calls fire one at a time (preparing → read → preparing → read → ...) rather than as one batched assistant message with 4 tool_calls

Expected Behavior

  • A sensible temperature default (e.g. 0.2-0.3) is set for agent workloads on custom providers, OR temperature is exposed as a per-provider config field so users can set it without restarting the inference server.
  • parallel_tool_calls: true is sent by default on outbound /v1/chat/completions requests when tools are present, matching the Codex transport (transports/codex.py:98) and the OpenAI API spec.

Actual Behavior

  • No temperature field is sent on outbound requests. llama.cpp falls back to its default of 1.0 (verified via /props: default_generation_settings.temperature = 1.0). Model produces factually drifted output.
  • No parallel_tool_calls field is sent. Even when the chat template advertises support (chat_template_caps.supports_parallel_tool_calls: true per /props), the model serializes tool calls into separate turns. A direct probe against the same llama-server shows that adding parallel_tool_calls:true to the request body produces 3 tool_calls in one assistant turn; without it, only 1. Behavior is fully reproducible.

Affected Component

Agent Core (conversation loop, context compression, memory), Configuration (config.yaml, .env, hermes setup)

Messaging Platform (if gateway-related)

N/A (CLI only)

Debug Report

Report       https://paste.rs/ttMcl
agent.log    https://paste.rs/B1DTq
gateway.log  https://paste.rs/GLzg9

Operating System

Ubuntu 24.04.4 LTS

Python Version

Python 3.11.15

Hermes Version

0.12.0 (2026.4.30)

Additional Logs / Traceback (optional)

Root Cause Analysis (optional)

Issue 1 — temperature:

agent/transports/chat_completions.py:245-251 only adds temperature to the request when fixed_temperature is provided:

# Temperature
fixed_temp = params.get("fixed_temperature")
omit_temp = params.get("omit_temperature", False)
if omit_temp:
    api_kwargs.pop("temperature", None)
elif fixed_temp is not None:
    api_kwargs["temperature"] = fixed_temp

fixed_temperature is populated by _fixed_temperature_for_model(), which only returns a value for specific cloud model families (GPT-5, Codex, etc.). Custom OpenAI-compatible providers never hit that branch, so temperature is never added to api_kwargs and the server's default is used.

Issue 2 — parallel_tool_calls:

A repo-wide grep for parallel_tool_calls returns only two hits:

  • agent/transports/codex.py:98 — hardcoded parallel_tool_calls: True for the Codex/Responses transport
  • agent/codex_responses_adapter.py:677,708-709 — passthrough handling for the Codex Responses adapter

The chat_completions transport (the one custom providers use) never sets the field. With OpenAI's spec defaulting to true but llama.cpp (and many other backends) requiring it explicitly, the omission produces serial-only tool calling on local stacks.

Proposed Fix (optional)

Two minimal changes in agent/transports/chat_completions.py:

  1. Default temperature for custom providers. Around line 251, when fixed_temperature is None AND is_custom_provider is True, set a sane default:
    elif params.get("is_custom_provider"):
    api_kwargs["temperature"] = params.get("temperature", 0.2)

  2. Default parallel_tool_calls when tools are present. Around line 265, after api_kwargs["tools"] = tools, add:
    api_kwargs.setdefault("parallel_tool_calls", True)

A more configurable variant: add optional temperature and parallel_tool_calls fields to the custom_providers schema:
custom_providers:
- name: Local
base_url: http://localhost:8080/v1
model:
temperature: 0.2 # new
parallel_tool_calls: true # new

Happy to send a PR if the maintainers prefer one approach over the other.

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existsarea/configConfig system, migrations, profilescomp/agentCore agent loop, run_agent.py, prompt buildertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions