Problem or Use Case
Summary
When using Hermes Agent with a local Ollama backend and a thinking-capable model (e.g. qwen3:8b), Hermes never passes think: false in the chat completions request. This causes the model to run its full reasoning chain on every request, which on CPU inference can take several minutes before producing the first output token — making the agent loop effectively unusable.
Environment
- Hermes Agent (latest)
- Ollama 0.20.0
- Model:
qwen3:8b Q4_K_M via custom OpenAI-compatible endpoint (http://host.docker.internal:11434/v1)
- CPU inference (no GPU)
Steps to reproduce
- Configure Hermes with a local Ollama endpoint
- Use any Qwen3 model (or other thinking-capable model)
- Send any message — observe multi-minute delay before first token
- Check Ollama logs — thinking tokens are being generated silently before any response content
Root cause
Ollama 0.6+ supports a think parameter in the /api/chat and /v1/chat/completions endpoints. When think: false is passed, the model skips the reasoning phase entirely and responds immediately. Hermes never passes this parameter, so thinking-capable models always run in thinking mode regardless of the user's reasoning_effort config.
The affected code is _build_api_kwargs() in run_agent.py around line 5394, where the chat completions payload is assembled.
Workaround
Manually patching run_agent.py to add "think": False to the api_kwargs dict fixes the issue and brings response time from several minutes down to ~1 second on the same hardware.
Proposed Solution
Add an opt-in config option (e.g. provider.think: false) or auto-detect when the endpoint is an Ollama instance and pass think: false when reasoning_effort is not explicitly enabled. At minimum, exposing this as an environment variable (HERMES_OLLAMA_THINK=false) would be a low-risk fix.
Happy to submit a PR if the maintainers can advise on the preferred approach.
Alternatives Considered
No response
Feature Type
Configuration option
Scope
None
Contribution
Problem or Use Case
Summary
When using Hermes Agent with a local Ollama backend and a thinking-capable model (e.g.
qwen3:8b), Hermes never passesthink: falsein the chat completions request. This causes the model to run its full reasoning chain on every request, which on CPU inference can take several minutes before producing the first output token — making the agent loop effectively unusable.Environment
qwen3:8bQ4_K_M via custom OpenAI-compatible endpoint (http://host.docker.internal:11434/v1)Steps to reproduce
Root cause
Ollama 0.6+ supports a
thinkparameter in the/api/chatand/v1/chat/completionsendpoints. Whenthink: falseis passed, the model skips the reasoning phase entirely and responds immediately. Hermes never passes this parameter, so thinking-capable models always run in thinking mode regardless of the user'sreasoning_effortconfig.The affected code is
_build_api_kwargs()inrun_agent.pyaround line 5394, where the chat completions payload is assembled.Workaround
Manually patching
run_agent.pyto add"think": Falseto theapi_kwargsdict fixes the issue and brings response time from several minutes down to ~1 second on the same hardware.Proposed Solution
Add an opt-in config option (e.g.
provider.think: false) or auto-detect when the endpoint is an Ollama instance and passthink: falsewhenreasoning_effortis not explicitly enabled. At minimum, exposing this as an environment variable (HERMES_OLLAMA_THINK=false) would be a low-risk fix.Happy to submit a PR if the maintainers can advise on the preferred approach.
Alternatives Considered
No response
Feature Type
Configuration option
Scope
None
Contribution