Skip to content

[Bug]: agent.reasoning_effort: none silently ignored on Ollama — main agent stuck in medium mode, bg-review fork can spiral (up to 65k tokens / 28 min) #25758

@pmocquard

Description

@pmocquard

Bug Description

When using Hermès with a custom provider pointing at a local Ollama instance and a thinking-capable model (Qwen3.x, DeepSeek-style), agent.reasoning_effort: none is silently ignored — both for the main agent and for the background-review fork. The model thinks anyway, sometimes catastrophically (we observed up to 209,538 chars of reasoning_content and 65,056 output tokens in a single tour, blocking the GPU for 28 minutes).

The root cause is in two distinct places in run_agent.py, and both need to be addressed:

  1. For the main agent on Ollama: Hermès emits extra_body.think=False via the custom provider plugin (the fix for [Feature]: Pass think: false to Ollama for non-reasoning models #6152), but Ollama's /v1/chat/completions endpoint silently ignores think:false. The top-level reasoning_effort=none field — which Ollama does respect — is never emitted.

  2. For the background-review fork: _spawn_background_review() creates a new AIAgent without propagating self.reasoning_config. Even when (1) is fixed, the fork still runs with the default reasoning_effort=medium, because extra_body.think=False is also never produced for it.

Defect (2) is structurally similar to #15543 (fork missing api_key/base_url/api_mode until that was fixed) — the fork loses inherited state.

Environment

  • Hermès Agent: nousresearch/hermes-agent:latest (Docker)
  • Provider: custom (Ollama 0.19, local, MLX backend)
  • Model: qwen3.6:35b-a3b-coding-nvfp4 (MoE, MLX-accelerated, thinking-capable)
  • Config: agent.reasoning_effort: none
  • Ollama upstream: ignores think:false on /v1/chat/completions (ollama#14820)

Steps to Reproduce

  1. Configure Hermès with a custom provider pointing at a local Ollama instance running a thinking-capable model.
  2. Set agent.reasoning_effort: none in config.yaml.
  3. Send any prompt to the main agent — observe in the session JSON (/opt/data/sessions/session_*.json) that assistant messages still carry non-empty reasoning_content.
  4. Run a non-trivial session (5+ tool calls, file edits) so that _should_review_memory or _should_review_skills triggers a bg-review at the end.
  5. Quit the session.
  6. On some workloads — especially when the skill content contains internal contradictions the model tries to resolve — the bg-review enters a reasoning loop:
WAIT. OH WAIT. WAIT.: WHEN... WAS: WAIT. WAIT. OH WAIT...

Expected Behavior

No reasoning loop.

Actual Behavior

Evidence

From a real bg-review session log:

API call #1: in=37315 out=485   total=37800   latency=78s     ← normal
API call #2: in=46310 out=578   total=46888   latency=36s     ← normal
API call #3: in=46869 out=65056 total=111925  latency=1724s   ← spiral (28 min)

The offending assistant message in the session JSON:

{
  "role": "assistant",
  "content": "(empty)",
  "reasoning_content": "...209538 chars of WAIT / OH WAIT cascade...",
  "finish_reason": "stop",
  "_empty_recovery_synthetic": true
}

Hermès' _empty_recovery_synthetic mechanism correctly catches the empty response and nudges the model, but 28 minutes of GPU decode are already gone.

Affected Component

CLI (interactive chat)

Messaging Platform (if gateway-related)

No response

Debug Report

## Root Cause

### Defect 1 — main agent on Ollama

The `custom` provider plugin sets `extra_body["think"] = False` when `reasoning_config.effort == "none"` or `enabled is False`. But Ollama silently ignores `extra_body.think` on `/v1/chat/completions` (it only honors it on `/api/chat`). The top-level `reasoning_effort` field that Ollama **does** support is never emitted from `_build_api_kwargs()` for `custom` providers.

### Defect 2 — bg-review fork

In `run_agent.py`, `_spawn_background_review()` (around line 4117):


review_agent = AIAgent(
    model=self.model,
    max_iterations=16,
    quiet_mode=True,
    platform=self.platform,
    provider=self.provider,
    api_mode=_parent_runtime.get("api_mode") or None,
    base_url=_parent_runtime.get("base_url") or None,
    api_key=_parent_runtime.get("api_key") or None,
    credential_pool=getattr(self, "_credential_pool", None),
    parent_session_id=self.session_id,
    enabled_toolsets=["memory", "skills"],
)
review_agent._memory_write_origin = "background_review"
review_agent._memory_write_context = "background_review"
review_agent._memory_store = self._memory_store
review_agent._memory_enabled = self._memory_enabled
review_agent._user_profile_enabled = self._user_profile_enabled
review_agent._memory_nudge_interval = 0
review_agent._skill_nudge_interval = 0


`self.reasoning_config` is never propagated. Since `AIAgent.__init__` defaults `reasoning_config=None` (= medium), the fork runs in medium mode regardless of the parent's effective config — even after Defect 1 is fixed for the parent.

## Proposed Fix

### Fix 1 — emit top-level `reasoning_effort` for Ollama

In `_build_api_kwargs()` in `run_agent.py`, after the existing `extra_body["think"]=False` block, mirror the value at the top level when the target is Ollama-style:


if isinstance(api_kwargs, dict):
    _eb = api_kwargs.get("extra_body")
    if isinstance(_eb, dict) and _eb.get("think") is False:
        api_kwargs["reasoning_effort"] = "none"


Rationale: Ollama's `/v1/chat/completions` accepts `reasoning_effort` at the top level (it's a standard OpenAI-style field for some upstream models) and uses it to suppress thinking. Other Ollama-compatible servers that don't recognize the field will simply ignore it. This was reported separately at [ollama#14820](https://github.com/ollama/ollama/issues/14820).

### Fix 2 — propagate `reasoning_config` to the fork

In `_spawn_background_review()`, after the existing `review_agent.X = self.X` assignments:


review_agent.reasoning_config = self.reasoning_config or {"enabled": False, "effort": "none"}


The `or {...}` fallback handles the case where the parent itself has `reasoning_config=None` (default). For non-Ollama setups, this is effectively a no-op for models that don't expose reasoning toggling — the provider plugins ignore the field.

## Validation

After applying both fixes in our deployment, on the same workload that previously spiraled:

| Metric | Before fixes | After fixes |
|---|---|---|
| `reasoning_content` per main-agent message | non-empty | **0 chars** |
| `reasoning_content` per bg-review message | up to 209,538 chars | **0 chars** |
| Max `out=` tokens on bg-review tour | 65,056 | **3,894** |
| Max latency on bg-review tour | 1,724s (28 min) | **88s** |
| `_empty_recovery_synthetic` triggered | yes | no |
| Bg-review still produces useful `tool_calls` | yes (eclipsed by reasoning) | **yes (clean)** |

The bg-review continues to do real work — `skill_manage` patches, `execute_code` blocks of 10–14k chars — so the self-improvement loop stays fully functional. The main agent also stops paying the medium-reasoning tax.

## References

- #6152 — initial Ollama `think:false` support (resolved, but the emitted field is silently dropped by `/v1/chat/completions`)
- #15543 — earlier instance of bg-review fork losing inherited state (auth credentials)
- [ollama/ollama#14820](https://github.com/ollama/ollama/issues/14820) — upstream Ollama bug: `think:false` ignored on `/v1/chat/completions`

Operating System

MacOS 26.4.1

Python Version

3.13.5

Hermes Version

v0.13.0

Additional Logs / Traceback (optional)

Root Cause Analysis (optional)

Root Cause

Defect 1 — main agent on Ollama

The custom provider plugin sets extra_body["think"] = False when reasoning_config.effort == "none" or enabled is False. But Ollama silently ignores extra_body.think on /v1/chat/completions (it only honors it on /api/chat). The top-level reasoning_effort field that Ollama does support is never emitted from _build_api_kwargs() for custom providers.

Defect 2 — bg-review fork

In run_agent.py, _spawn_background_review() (around line 4117):

review_agent = AIAgent(
    model=self.model,
    max_iterations=16,
    quiet_mode=True,
    platform=self.platform,
    provider=self.provider,
    api_mode=_parent_runtime.get("api_mode") or None,
    base_url=_parent_runtime.get("base_url") or None,
    api_key=_parent_runtime.get("api_key") or None,
    credential_pool=getattr(self, "_credential_pool", None),
    parent_session_id=self.session_id,
    enabled_toolsets=["memory", "skills"],
)
review_agent._memory_write_origin = "background_review"
review_agent._memory_write_context = "background_review"
review_agent._memory_store = self._memory_store
review_agent._memory_enabled = self._memory_enabled
review_agent._user_profile_enabled = self._user_profile_enabled
review_agent._memory_nudge_interval = 0
review_agent._skill_nudge_interval = 0

self.reasoning_config is never propagated. Since AIAgent.__init__ defaults reasoning_config=None (= medium), the fork runs in medium mode regardless of the parent's effective config — even after Defect 1 is fixed for the parent.

Proposed Fix (optional)

Fix 1 — emit top-level reasoning_effort for Ollama

In _build_api_kwargs() in run_agent.py, after the existing extra_body["think"]=False block, mirror the value at the top level when the target is Ollama-style:

if isinstance(api_kwargs, dict):
    _eb = api_kwargs.get("extra_body")
    if isinstance(_eb, dict) and _eb.get("think") is False:
        api_kwargs["reasoning_effort"] = "none"

Rationale: Ollama's /v1/chat/completions accepts reasoning_effort at the top level (it's a standard OpenAI-style field for some upstream models) and uses it to suppress thinking. Other Ollama-compatible servers that don't recognize the field will simply ignore it. This was reported separately at ollama#14820.

Fix 2 — propagate reasoning_config to the fork

In _spawn_background_review(), after the existing review_agent.X = self.X assignments:

review_agent.reasoning_config = self.reasoning_config or {"enabled": False, "effort": "none"}

The or {...} fallback handles the case where the parent itself has reasoning_config=None (default). For non-Ollama setups, this is effectively a no-op for models that don't expose reasoning toggling — the provider plugins ignore the field.

Validation

After applying both fixes in our deployment, on the same workload that previously spiraled:

Metric Before fixes After fixes
reasoning_content per main-agent message non-empty 0 chars
reasoning_content per bg-review message up to 209,538 chars 0 chars
Max out= tokens on bg-review tour 65,056 3,894
Max latency on bg-review tour 1,724s (28 min) 88s
_empty_recovery_synthetic triggered yes no
Bg-review still produces useful tool_calls yes (eclipsed by reasoning) yes (clean)

The bg-review continues to do real work — skill_manage patches, execute_code blocks of 10–14k chars — so the self-improvement loop stays fully functional. The main agent also stops paying the medium-reasoning tax.

References

Happy to submit a PR with both fixes if the maintainers want.

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/agentCore agent loop, run_agent.py, prompt builderprovider/ollamaOllama / local modelstype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions