Summary
The max_tokens option documented in cli-config.yaml.example is never read from user config or propagated to AIAgent.__init__(). As a result, on providers without a hardcoded default in ChatCompletionsTransport.build_kwargs() (e.g. Ollama Cloud, zai, custom OpenAI-compatible endpoints), the parameter is omitted from the API call entirely and the server falls back to its own short default. Responses get truncated with finish_reason="length" (or, on Ollama, the suspicious-stop heuristic flags them as truncated), and the user has no way to fix it via config.
Symptom
Running GLM-4.6 / GLM-5.x via Ollama Cloud (and similar custom endpoints):
⚠️ Treating suspicious Ollama/GLM stop response as truncated
⚠️ Response truncated (finish_reason='length') - model hit max output tokens
max_tokens: 16384 set under model: in ~/.hermes/config.yaml has no effect.
Steps to Reproduce
- Configure a custom-endpoint or Ollama Cloud provider with a GLM model:
model:
default: glm-5.1
provider: ollama-cloud
base_url: https://ollama.com/v1
max_tokens: 16384
- Send a prompt that elicits a long response (e.g. "summarize this 5k-line file in detail").
- Observe truncation warnings and short responses regardless of the
max_tokens value in config.
Root cause
Three independent gaps on the same code path:
1. _resolve_runtime_agent_kwargs() does not include max_tokens — gateway/run.py:553-589 returns only:
return {
"api_key": runtime.get("api_key"),
"base_url": runtime.get("base_url"),
"provider": runtime.get("provider"),
"api_mode": runtime.get("api_mode"),
"command": runtime.get("command"),
"args": list(runtime.get("args") or []),
"credential_pool": runtime.get("credential_pool"),
}
_resolve_turn_agent_config() (gateway/run.py:1593-1601) copies the same set into runtime without adding max_tokens.
2. No call site passes max_tokens= to AIAgent(...) — verified across all 6 instantiation sites:
cli.py:3641 (interactive CLI)
cli.py:6811 (background agent)
gateway/run.py:6167 (hygiene agent)
gateway/run.py:8927 (gateway turn)
gateway/run.py:9411 (tmp agent)
gateway/run.py:13335 (gateway turn alt path)
AIAgent.__init__ declares max_tokens: int = None (run_agent.py:937), so self.max_tokens ends up None for every gateway- or CLI-created agent.
3. ChatCompletionsTransport.build_kwargs() drops the param entirely for non-privileged providers — agent/transports/chat_completions.py:267-289:
if ephemeral is not None and max_tokens_fn:
api_kwargs.update(max_tokens_fn(ephemeral))
elif max_tokens is not None and max_tokens_fn:
api_kwargs.update(max_tokens_fn(max_tokens))
elif is_nvidia_nim and max_tokens_fn:
api_kwargs.update(max_tokens_fn(16384))
elif is_qwen and max_tokens_fn:
api_kwargs.update(max_tokens_fn(65536))
elif is_kimi and max_tokens_fn:
api_kwargs.update(max_tokens_fn(32000))
elif anthropic_max_out is not None:
api_kwargs["max_tokens"] = anthropic_max_out
# else: max_tokens is omitted from the API request entirely
For zai / Ollama Cloud / custom OpenAI-compatible endpoints there is no is_zai / is_glm / is_ollama branch, so the request goes out with no max_tokens and the server uses its own (small) default.
4. No env-var fallback either — there is no HERMES_MAX_TOKENS or HERMES_MAX_OUTPUT reader anywhere in the codebase.
Documentation gap
cli-config.yaml.example:58-65 documents the option as user-facing:
# max_tokens: OUTPUT cap — maximum tokens the model may generate per response.
# Leave unset to use the model's native output ceiling (recommended).
# Set only if you want to deliberately limit individual response length.
# max_tokens: 8192
…but the value is never read, so this comment is misleading.
Expected behavior
Setting max_tokens (or model.max_tokens) in config.yaml should result in the corresponding parameter being passed to the underlying API call, regardless of provider.
Suggested fix
Three options, ordered by surface area:
-
Single-source default: change run_agent.py:937 to a sensible default (e.g. max_tokens: int = 16384) so all paths benefit. Cheapest fix, but doesn't honor user config.
-
Read config in _resolve_runtime_agent_kwargs() — load model.max_tokens from config.yaml, add it to the returned dict, then thread it through _resolve_turn_agent_config() so all gateway AIAgent(...) calls pick it up. Mirror the same change in cli.py (read user config, pass max_tokens= to the two AIAgent(...) sites). This is what the docs already promise.
-
Per-provider default in build_kwargs() — add an is_zai / is_ollama branch alongside is_qwen / is_kimi / is_nvidia_nim to set a sane fallback (e.g. 32k or 65k) when max_tokens is None. Complementary to (2).
Happy to submit a PR if maintainers confirm the preferred approach.
Environment
- hermes-agent:
main (commit at time of report)
- Provider: Ollama Cloud / custom endpoint with GLM models
- OS: Ubuntu 25.04
- Python: 3.11.15
Related
Symptom-adjacent issues:
Summary
The
max_tokensoption documented incli-config.yaml.exampleis never read from user config or propagated toAIAgent.__init__(). As a result, on providers without a hardcoded default inChatCompletionsTransport.build_kwargs()(e.g. Ollama Cloud, zai, custom OpenAI-compatible endpoints), the parameter is omitted from the API call entirely and the server falls back to its own short default. Responses get truncated withfinish_reason="length"(or, on Ollama, the suspicious-stopheuristic flags them as truncated), and the user has no way to fix it via config.Symptom
Running GLM-4.6 / GLM-5.x via Ollama Cloud (and similar custom endpoints):
max_tokens: 16384set undermodel:in~/.hermes/config.yamlhas no effect.Steps to Reproduce
max_tokensvalue in config.Root cause
Three independent gaps on the same code path:
1.
_resolve_runtime_agent_kwargs()does not includemax_tokens—gateway/run.py:553-589returns only:_resolve_turn_agent_config()(gateway/run.py:1593-1601) copies the same set intoruntimewithout addingmax_tokens.2. No call site passes
max_tokens=toAIAgent(...)— verified across all 6 instantiation sites:cli.py:3641(interactive CLI)cli.py:6811(background agent)gateway/run.py:6167(hygiene agent)gateway/run.py:8927(gateway turn)gateway/run.py:9411(tmp agent)gateway/run.py:13335(gateway turn alt path)AIAgent.__init__declaresmax_tokens: int = None(run_agent.py:937), soself.max_tokensends upNonefor every gateway- or CLI-created agent.3.
ChatCompletionsTransport.build_kwargs()drops the param entirely for non-privileged providers —agent/transports/chat_completions.py:267-289:For zai / Ollama Cloud / custom OpenAI-compatible endpoints there is no
is_zai/is_glm/is_ollamabranch, so the request goes out with nomax_tokensand the server uses its own (small) default.4. No env-var fallback either — there is no
HERMES_MAX_TOKENSorHERMES_MAX_OUTPUTreader anywhere in the codebase.Documentation gap
cli-config.yaml.example:58-65documents the option as user-facing:…but the value is never read, so this comment is misleading.
Expected behavior
Setting
max_tokens(ormodel.max_tokens) inconfig.yamlshould result in the corresponding parameter being passed to the underlying API call, regardless of provider.Suggested fix
Three options, ordered by surface area:
Single-source default: change
run_agent.py:937to a sensible default (e.g.max_tokens: int = 16384) so all paths benefit. Cheapest fix, but doesn't honor user config.Read config in
_resolve_runtime_agent_kwargs()— loadmodel.max_tokensfromconfig.yaml, add it to the returned dict, then thread it through_resolve_turn_agent_config()so all gatewayAIAgent(...)calls pick it up. Mirror the same change incli.py(read user config, passmax_tokens=to the twoAIAgent(...)sites). This is what the docs already promise.Per-provider default in
build_kwargs()— add anis_zai/is_ollamabranch alongsideis_qwen/is_kimi/is_nvidia_nimto set a sane fallback (e.g. 32k or 65k) whenmax_tokensisNone. Complementary to (2).Happy to submit a PR if maintainers confirm the preferred approach.
Environment
main(commit at time of report)Related
Symptom-adjacent issues:
finish_reason=lengthfamily.