Skip to content

Bug: max_tokens from config.yaml is silently ignored — never propagated to AIAgent, causing output truncation on Ollama Cloud / zai / custom endpoints #20741

@nemaloveroyatno

Description

@nemaloveroyatno

Summary

The max_tokens option documented in cli-config.yaml.example is never read from user config or propagated to AIAgent.__init__(). As a result, on providers without a hardcoded default in ChatCompletionsTransport.build_kwargs() (e.g. Ollama Cloud, zai, custom OpenAI-compatible endpoints), the parameter is omitted from the API call entirely and the server falls back to its own short default. Responses get truncated with finish_reason="length" (or, on Ollama, the suspicious-stop heuristic flags them as truncated), and the user has no way to fix it via config.

Symptom

Running GLM-4.6 / GLM-5.x via Ollama Cloud (and similar custom endpoints):

⚠️  Treating suspicious Ollama/GLM stop response as truncated
⚠️  Response truncated (finish_reason='length') - model hit max output tokens

max_tokens: 16384 set under model: in ~/.hermes/config.yaml has no effect.

Steps to Reproduce

  1. Configure a custom-endpoint or Ollama Cloud provider with a GLM model:
   model:
     default: glm-5.1
     provider: ollama-cloud
     base_url: https://ollama.com/v1
     max_tokens: 16384
  1. Send a prompt that elicits a long response (e.g. "summarize this 5k-line file in detail").
  2. Observe truncation warnings and short responses regardless of the max_tokens value in config.

Root cause

Three independent gaps on the same code path:

1. _resolve_runtime_agent_kwargs() does not include max_tokensgateway/run.py:553-589 returns only:

return {
    "api_key": runtime.get("api_key"),
    "base_url": runtime.get("base_url"),
    "provider": runtime.get("provider"),
    "api_mode": runtime.get("api_mode"),
    "command": runtime.get("command"),
    "args": list(runtime.get("args") or []),
    "credential_pool": runtime.get("credential_pool"),
}

_resolve_turn_agent_config() (gateway/run.py:1593-1601) copies the same set into runtime without adding max_tokens.

2. No call site passes max_tokens= to AIAgent(...) — verified across all 6 instantiation sites:

  • cli.py:3641 (interactive CLI)
  • cli.py:6811 (background agent)
  • gateway/run.py:6167 (hygiene agent)
  • gateway/run.py:8927 (gateway turn)
  • gateway/run.py:9411 (tmp agent)
  • gateway/run.py:13335 (gateway turn alt path)

AIAgent.__init__ declares max_tokens: int = None (run_agent.py:937), so self.max_tokens ends up None for every gateway- or CLI-created agent.

3. ChatCompletionsTransport.build_kwargs() drops the param entirely for non-privileged providersagent/transports/chat_completions.py:267-289:

if ephemeral is not None and max_tokens_fn:
    api_kwargs.update(max_tokens_fn(ephemeral))
elif max_tokens is not None and max_tokens_fn:
    api_kwargs.update(max_tokens_fn(max_tokens))
elif is_nvidia_nim and max_tokens_fn:
    api_kwargs.update(max_tokens_fn(16384))
elif is_qwen and max_tokens_fn:
    api_kwargs.update(max_tokens_fn(65536))
elif is_kimi and max_tokens_fn:
    api_kwargs.update(max_tokens_fn(32000))
elif anthropic_max_out is not None:
    api_kwargs["max_tokens"] = anthropic_max_out
# else: max_tokens is omitted from the API request entirely

For zai / Ollama Cloud / custom OpenAI-compatible endpoints there is no is_zai / is_glm / is_ollama branch, so the request goes out with no max_tokens and the server uses its own (small) default.

4. No env-var fallback either — there is no HERMES_MAX_TOKENS or HERMES_MAX_OUTPUT reader anywhere in the codebase.

Documentation gap

cli-config.yaml.example:58-65 documents the option as user-facing:

# max_tokens: OUTPUT cap — maximum tokens the model may generate per response.
#   Leave unset to use the model's native output ceiling (recommended).
#   Set only if you want to deliberately limit individual response length.
# max_tokens: 8192

…but the value is never read, so this comment is misleading.

Expected behavior

Setting max_tokens (or model.max_tokens) in config.yaml should result in the corresponding parameter being passed to the underlying API call, regardless of provider.

Suggested fix

Three options, ordered by surface area:

  1. Single-source default: change run_agent.py:937 to a sensible default (e.g. max_tokens: int = 16384) so all paths benefit. Cheapest fix, but doesn't honor user config.

  2. Read config in _resolve_runtime_agent_kwargs() — load model.max_tokens from config.yaml, add it to the returned dict, then thread it through _resolve_turn_agent_config() so all gateway AIAgent(...) calls pick it up. Mirror the same change in cli.py (read user config, pass max_tokens= to the two AIAgent(...) sites). This is what the docs already promise.

  3. Per-provider default in build_kwargs() — add an is_zai / is_ollama branch alongside is_qwen / is_kimi / is_nvidia_nim to set a sane fallback (e.g. 32k or 65k) when max_tokens is None. Complementary to (2).

Happy to submit a PR if maintainers confirm the preferred approach.

Environment

  • hermes-agent: main (commit at time of report)
  • Provider: Ollama Cloud / custom endpoint with GLM models
  • OS: Ubuntu 25.04
  • Python: 3.11.15

Related

Symptom-adjacent issues:

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existsarea/configConfig system, migrations, profilescomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions