[Bug]: agent.reasoning_effort: none silently ignored on Ollama — main agent stuck in medium mode, bg-review fork can spiral (up to 65k tokens / 28 min)

### Bug Description

When using Hermès with a `custom` provider pointing at a local Ollama instance and a thinking-capable model (Qwen3.x, DeepSeek-style), `agent.reasoning_effort: none` is silently ignored — both for the main agent and for the background-review fork. The model thinks anyway, sometimes catastrophically (we observed up to 209,538 chars of `reasoning_content` and 65,056 output tokens in a single tour, blocking the GPU for 28 minutes).

The root cause is in **two distinct places** in `run_agent.py`, and both need to be addressed:

1. **For the main agent on Ollama**: Hermès emits `extra_body.think=False` via the `custom` provider plugin (the fix for #6152), but Ollama's `/v1/chat/completions` endpoint silently ignores `think:false`. The top-level `reasoning_effort=none` field — which Ollama does respect — is never emitted.

2. **For the background-review fork**: `_spawn_background_review()` creates a new `AIAgent` without propagating `self.reasoning_config`. Even when (1) is fixed, the fork still runs with the default `reasoning_effort=medium`, because `extra_body.think=False` is also never produced for it.

Defect (2) is structurally similar to #15543 (fork missing `api_key`/`base_url`/`api_mode` until that was fixed) — the fork loses inherited state.

## Environment

- Hermès Agent: `nousresearch/hermes-agent:latest` (Docker)
- Provider: `custom` (Ollama 0.19, local, MLX backend)
- Model: `qwen3.6:35b-a3b-coding-nvfp4` (MoE, MLX-accelerated, thinking-capable)
- Config: `agent.reasoning_effort: none`
- Ollama upstream: ignores `think:false` on `/v1/chat/completions` ([ollama#14820](https://github.com/ollama/ollama/issues/14820))


### Steps to Reproduce

1. Configure Hermès with a `custom` provider pointing at a local Ollama instance running a thinking-capable model.
2. Set `agent.reasoning_effort: none` in `config.yaml`.
3. Send any prompt to the main agent — observe in the session JSON (`/opt/data/sessions/session_*.json`) that assistant messages still carry non-empty `reasoning_content`.
4. Run a non-trivial session (5+ tool calls, file edits) so that `_should_review_memory` or `_should_review_skills` triggers a bg-review at the end.
5. Quit the session.
6. On some workloads — especially when the skill content contains internal contradictions the model tries to resolve — the bg-review enters a reasoning loop:

```
WAIT. OH WAIT. WAIT.: WHEN... WAS: WAIT. WAIT. OH WAIT...
```


### Expected Behavior

No reasoning loop.

### Actual Behavior

## Evidence

From a real bg-review session log:

```
API call #1: in=37315 out=485   total=37800   latency=78s     ← normal
API call #2: in=46310 out=578   total=46888   latency=36s     ← normal
API call #3: in=46869 out=65056 total=111925  latency=1724s   ← spiral (28 min)
```

The offending assistant message in the session JSON:

```json
{
  "role": "assistant",
  "content": "(empty)",
  "reasoning_content": "...209538 chars of WAIT / OH WAIT cascade...",
  "finish_reason": "stop",
  "_empty_recovery_synthetic": true
}
```

Hermès' `_empty_recovery_synthetic` mechanism correctly catches the empty response and nudges the model, but 28 minutes of GPU decode are already gone.


### Affected Component

CLI (interactive chat)

### Messaging Platform (if gateway-related)

_No response_

### Debug Report

```shell
## Root Cause

### Defect 1 — main agent on Ollama

The `custom` provider plugin sets `extra_body["think"] = False` when `reasoning_config.effort == "none"` or `enabled is False`. But Ollama silently ignores `extra_body.think` on `/v1/chat/completions` (it only honors it on `/api/chat`). The top-level `reasoning_effort` field that Ollama **does** support is never emitted from `_build_api_kwargs()` for `custom` providers.

### Defect 2 — bg-review fork

In `run_agent.py`, `_spawn_background_review()` (around line 4117):


review_agent = AIAgent(
    model=self.model,
    max_iterations=16,
    quiet_mode=True,
    platform=self.platform,
    provider=self.provider,
    api_mode=_parent_runtime.get("api_mode") or None,
    base_url=_parent_runtime.get("base_url") or None,
    api_key=_parent_runtime.get("api_key") or None,
    credential_pool=getattr(self, "_credential_pool", None),
    parent_session_id=self.session_id,
    enabled_toolsets=["memory", "skills"],
)
review_agent._memory_write_origin = "background_review"
review_agent._memory_write_context = "background_review"
review_agent._memory_store = self._memory_store
review_agent._memory_enabled = self._memory_enabled
review_agent._user_profile_enabled = self._user_profile_enabled
review_agent._memory_nudge_interval = 0
review_agent._skill_nudge_interval = 0


`self.reasoning_config` is never propagated. Since `AIAgent.__init__` defaults `reasoning_config=None` (= medium), the fork runs in medium mode regardless of the parent's effective config — even after Defect 1 is fixed for the parent.

## Proposed Fix

### Fix 1 — emit top-level `reasoning_effort` for Ollama

In `_build_api_kwargs()` in `run_agent.py`, after the existing `extra_body["think"]=False` block, mirror the value at the top level when the target is Ollama-style:


if isinstance(api_kwargs, dict):
    _eb = api_kwargs.get("extra_body")
    if isinstance(_eb, dict) and _eb.get("think") is False:
        api_kwargs["reasoning_effort"] = "none"


Rationale: Ollama's `/v1/chat/completions` accepts `reasoning_effort` at the top level (it's a standard OpenAI-style field for some upstream models) and uses it to suppress thinking. Other Ollama-compatible servers that don't recognize the field will simply ignore it. This was reported separately at [ollama#14820](https://github.com/ollama/ollama/issues/14820).

### Fix 2 — propagate `reasoning_config` to the fork

In `_spawn_background_review()`, after the existing `review_agent.X = self.X` assignments:


review_agent.reasoning_config = self.reasoning_config or {"enabled": False, "effort": "none"}


The `or {...}` fallback handles the case where the parent itself has `reasoning_config=None` (default). For non-Ollama setups, this is effectively a no-op for models that don't expose reasoning toggling — the provider plugins ignore the field.

## Validation

After applying both fixes in our deployment, on the same workload that previously spiraled:

| Metric | Before fixes | After fixes |
|---|---|---|
| `reasoning_content` per main-agent message | non-empty | **0 chars** |
| `reasoning_content` per bg-review message | up to 209,538 chars | **0 chars** |
| Max `out=` tokens on bg-review tour | 65,056 | **3,894** |
| Max latency on bg-review tour | 1,724s (28 min) | **88s** |
| `_empty_recovery_synthetic` triggered | yes | no |
| Bg-review still produces useful `tool_calls` | yes (eclipsed by reasoning) | **yes (clean)** |

The bg-review continues to do real work — `skill_manage` patches, `execute_code` blocks of 10–14k chars — so the self-improvement loop stays fully functional. The main agent also stops paying the medium-reasoning tax.

## References

- #6152 — initial Ollama `think:false` support (resolved, but the emitted field is silently dropped by `/v1/chat/completions`)
- #15543 — earlier instance of bg-review fork losing inherited state (auth credentials)
- [ollama/ollama#14820](https://github.com/ollama/ollama/issues/14820) — upstream Ollama bug: `think:false` ignored on `/v1/chat/completions`
```

### Operating System

MacOS 26.4.1

### Python Version

3.13.5

### Hermes Version

v0.13.0

### Additional Logs / Traceback (optional)

```shell

```

### Root Cause Analysis (optional)

## Root Cause

### Defect 1 — main agent on Ollama

The `custom` provider plugin sets `extra_body["think"] = False` when `reasoning_config.effort == "none"` or `enabled is False`. But Ollama silently ignores `extra_body.think` on `/v1/chat/completions` (it only honors it on `/api/chat`). The top-level `reasoning_effort` field that Ollama **does** support is never emitted from `_build_api_kwargs()` for `custom` providers.

### Defect 2 — bg-review fork

In `run_agent.py`, `_spawn_background_review()` (around line 4117):

```python
review_agent = AIAgent(
    model=self.model,
    max_iterations=16,
    quiet_mode=True,
    platform=self.platform,
    provider=self.provider,
    api_mode=_parent_runtime.get("api_mode") or None,
    base_url=_parent_runtime.get("base_url") or None,
    api_key=_parent_runtime.get("api_key") or None,
    credential_pool=getattr(self, "_credential_pool", None),
    parent_session_id=self.session_id,
    enabled_toolsets=["memory", "skills"],
)
review_agent._memory_write_origin = "background_review"
review_agent._memory_write_context = "background_review"
review_agent._memory_store = self._memory_store
review_agent._memory_enabled = self._memory_enabled
review_agent._user_profile_enabled = self._user_profile_enabled
review_agent._memory_nudge_interval = 0
review_agent._skill_nudge_interval = 0
```

`self.reasoning_config` is never propagated. Since `AIAgent.__init__` defaults `reasoning_config=None` (= medium), the fork runs in medium mode regardless of the parent's effective config — even after Defect 1 is fixed for the parent.



### Proposed Fix (optional)

### Fix 1 — emit top-level `reasoning_effort` for Ollama

In `_build_api_kwargs()` in `run_agent.py`, after the existing `extra_body["think"]=False` block, mirror the value at the top level when the target is Ollama-style:

```python
if isinstance(api_kwargs, dict):
    _eb = api_kwargs.get("extra_body")
    if isinstance(_eb, dict) and _eb.get("think") is False:
        api_kwargs["reasoning_effort"] = "none"
```

Rationale: Ollama's `/v1/chat/completions` accepts `reasoning_effort` at the top level (it's a standard OpenAI-style field for some upstream models) and uses it to suppress thinking. Other Ollama-compatible servers that don't recognize the field will simply ignore it. This was reported separately at [ollama#14820](https://github.com/ollama/ollama/issues/14820).

### Fix 2 — propagate `reasoning_config` to the fork

In `_spawn_background_review()`, after the existing `review_agent.X = self.X` assignments:

```python
review_agent.reasoning_config = self.reasoning_config or {"enabled": False, "effort": "none"}
```

The `or {...}` fallback handles the case where the parent itself has `reasoning_config=None` (default). For non-Ollama setups, this is effectively a no-op for models that don't expose reasoning toggling — the provider plugins ignore the field.

## Validation

After applying both fixes in our deployment, on the same workload that previously spiraled:

| Metric | Before fixes | After fixes |
|---|---|---|
| `reasoning_content` per main-agent message | non-empty | **0 chars** |
| `reasoning_content` per bg-review message | up to 209,538 chars | **0 chars** |
| Max `out=` tokens on bg-review tour | 65,056 | **3,894** |
| Max latency on bg-review tour | 1,724s (28 min) | **88s** |
| `_empty_recovery_synthetic` triggered | yes | no |
| Bg-review still produces useful `tool_calls` | yes (eclipsed by reasoning) | **yes (clean)** |

The bg-review continues to do real work — `skill_manage` patches, `execute_code` blocks of 10–14k chars — so the self-improvement loop stays fully functional. The main agent also stops paying the medium-reasoning tax.

## References

- #6152 — initial Ollama `think:false` support (resolved, but the emitted field is silently dropped by `/v1/chat/completions`)
- #15543 — earlier instance of bg-review fork losing inherited state (auth credentials)
- [ollama/ollama#14820](https://github.com/ollama/ollama/issues/14820) — upstream Ollama bug: `think:false` ignored on `/v1/chat/completions`

Happy to submit a PR with both fixes if the maintainers want.


### Are you willing to submit a PR for this?

- [ ] I'd like to fix this myself and submit a PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: agent.reasoning_effort: none silently ignored on Ollama — main agent stuck in medium mode, bg-review fork can spiral (up to 65k tokens / 28 min) #25758

Bug Description

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Evidence

Affected Component

Messaging Platform (if gateway-related)

Debug Report

Operating System

Python Version

Hermes Version

Additional Logs / Traceback (optional)

Root Cause Analysis (optional)

Root Cause

Defect 1 — main agent on Ollama

Defect 2 — bg-review fork

Proposed Fix (optional)

Fix 1 — emit top-level `reasoning_effort` for Ollama

Fix 2 — propagate `reasoning_config` to the fork

Validation

References

Are you willing to submit a PR for this?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Before fixes	After fixes
`reasoning_content` per main-agent message	non-empty	0 chars
`reasoning_content` per bg-review message	up to 209,538 chars	0 chars
Max `out=` tokens on bg-review tour	65,056	3,894
Max latency on bg-review tour	1,724s (28 min)	88s
`_empty_recovery_synthetic` triggered	yes	no
Bg-review still produces useful `tool_calls`	yes (eclipsed by reasoning)	yes (clean)

[Bug]: agent.reasoning_effort: none silently ignored on Ollama — main agent stuck in medium mode, bg-review fork can spiral (up to 65k tokens / 28 min) #25758

Description

Bug Description

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Evidence

Affected Component

Messaging Platform (if gateway-related)

Debug Report

Operating System

Python Version

Hermes Version

Additional Logs / Traceback (optional)

Root Cause Analysis (optional)

Root Cause

Defect 1 — main agent on Ollama

Defect 2 — bg-review fork

Proposed Fix (optional)

Fix 1 — emit top-level reasoning_effort for Ollama

Fix 2 — propagate reasoning_config to the fork

Validation

References

Are you willing to submit a PR for this?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Fix 1 — emit top-level `reasoning_effort` for Ollama

Fix 2 — propagate `reasoning_config` to the fork