Bug: `max_tokens` from config.yaml is silently ignored — never propagated to AIAgent, causing output truncation on Ollama Cloud / zai / custom endpoints

## Summary

The `max_tokens` option documented in `cli-config.yaml.example` is never read from user config or propagated to `AIAgent.__init__()`. As a result, on providers without a hardcoded default in `ChatCompletionsTransport.build_kwargs()` (e.g. Ollama Cloud, zai, custom OpenAI-compatible endpoints), the parameter is omitted from the API call entirely and the server falls back to its own short default. Responses get truncated with `finish_reason="length"` (or, on Ollama, the suspicious-`stop` heuristic flags them as truncated), and the user has no way to fix it via config.

## Symptom

Running GLM-4.6 / GLM-5.x via Ollama Cloud (and similar custom endpoints):

```
⚠️  Treating suspicious Ollama/GLM stop response as truncated
⚠️  Response truncated (finish_reason='length') - model hit max output tokens
```

`max_tokens: 16384` set under `model:` in `~/.hermes/config.yaml` has no effect.

## Steps to Reproduce

1. Configure a custom-endpoint or Ollama Cloud provider with a GLM model:
```yaml
   model:
     default: glm-5.1
     provider: ollama-cloud
     base_url: https://ollama.com/v1
     max_tokens: 16384
```
2. Send a prompt that elicits a long response (e.g. "summarize this 5k-line file in detail").
3. Observe truncation warnings and short responses regardless of the `max_tokens` value in config.

## Root cause

Three independent gaps on the same code path:

**1. `_resolve_runtime_agent_kwargs()` does not include `max_tokens`** — `gateway/run.py:553-589` returns only:

```python
return {
    "api_key": runtime.get("api_key"),
    "base_url": runtime.get("base_url"),
    "provider": runtime.get("provider"),
    "api_mode": runtime.get("api_mode"),
    "command": runtime.get("command"),
    "args": list(runtime.get("args") or []),
    "credential_pool": runtime.get("credential_pool"),
}
```

`_resolve_turn_agent_config()` (`gateway/run.py:1593-1601`) copies the same set into `runtime` without adding `max_tokens`.

**2. No call site passes `max_tokens=` to `AIAgent(...)`** — verified across all 6 instantiation sites:

- `cli.py:3641` (interactive CLI)
- `cli.py:6811` (background agent)
- `gateway/run.py:6167` (hygiene agent)
- `gateway/run.py:8927` (gateway turn)
- `gateway/run.py:9411` (tmp agent)
- `gateway/run.py:13335` (gateway turn alt path)

`AIAgent.__init__` declares `max_tokens: int = None` (`run_agent.py:937`), so `self.max_tokens` ends up `None` for every gateway- or CLI-created agent.

**3. `ChatCompletionsTransport.build_kwargs()` drops the param entirely for non-privileged providers** — `agent/transports/chat_completions.py:267-289`:

```python
if ephemeral is not None and max_tokens_fn:
    api_kwargs.update(max_tokens_fn(ephemeral))
elif max_tokens is not None and max_tokens_fn:
    api_kwargs.update(max_tokens_fn(max_tokens))
elif is_nvidia_nim and max_tokens_fn:
    api_kwargs.update(max_tokens_fn(16384))
elif is_qwen and max_tokens_fn:
    api_kwargs.update(max_tokens_fn(65536))
elif is_kimi and max_tokens_fn:
    api_kwargs.update(max_tokens_fn(32000))
elif anthropic_max_out is not None:
    api_kwargs["max_tokens"] = anthropic_max_out
# else: max_tokens is omitted from the API request entirely
```

For zai / Ollama Cloud / custom OpenAI-compatible endpoints there is no `is_zai` / `is_glm` / `is_ollama` branch, so the request goes out with no `max_tokens` and the server uses its own (small) default.

**4. No env-var fallback either** — there is no `HERMES_MAX_TOKENS` or `HERMES_MAX_OUTPUT` reader anywhere in the codebase.

## Documentation gap

`cli-config.yaml.example:58-65` documents the option as user-facing:

```yaml
# max_tokens: OUTPUT cap — maximum tokens the model may generate per response.
#   Leave unset to use the model's native output ceiling (recommended).
#   Set only if you want to deliberately limit individual response length.
# max_tokens: 8192
```

…but the value is never read, so this comment is misleading.

## Expected behavior

Setting `max_tokens` (or `model.max_tokens`) in `config.yaml` should result in the corresponding parameter being passed to the underlying API call, regardless of provider.

## Suggested fix

Three options, ordered by surface area:

1. **Single-source default**: change `run_agent.py:937` to a sensible default (e.g. `max_tokens: int = 16384`) so all paths benefit. Cheapest fix, but doesn't honor user config.

2. **Read config in `_resolve_runtime_agent_kwargs()`** — load `model.max_tokens` from `config.yaml`, add it to the returned dict, then thread it through `_resolve_turn_agent_config()` so all gateway `AIAgent(...)` calls pick it up. Mirror the same change in `cli.py` (read user config, pass `max_tokens=` to the two `AIAgent(...)` sites). This is what the docs already promise.

3. **Per-provider default in `build_kwargs()`** — add an `is_zai` / `is_ollama` branch alongside `is_qwen` / `is_kimi` / `is_nvidia_nim` to set a sane fallback (e.g. 32k or 65k) when `max_tokens` is `None`. Complementary to (2).

Happy to submit a PR if maintainers confirm the preferred approach.

## Environment

- hermes-agent: `main` (commit at time of report)
- Provider: Ollama Cloud / custom endpoint with GLM models
- OS: Ubuntu 25.04
- Python: 3.11.15

## Related

Symptom-adjacent issues:
- #13042 (GLM-5.1 malformed JSON in long contexts) — different root cause but same `finish_reason=length` family.
- #9344 (glm-5-turbo reasoning tokens exhaust output budget) — also missing budget propagation, different path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: `max_tokens` from config.yaml is silently ignored — never propagated to AIAgent, causing output truncation on Ollama Cloud / zai / custom endpoints #20741

Summary

Symptom

Steps to Reproduce

Root cause

Documentation gap

Expected behavior

Suggested fix

Environment

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug: max_tokens from config.yaml is silently ignored — never propagated to AIAgent, causing output truncation on Ollama Cloud / zai / custom endpoints #20741

Description

Summary

Symptom

Steps to Reproduce

Root cause

Documentation gap

Expected behavior

Suggested fix

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug: `max_tokens` from config.yaml is silently ignored — never propagated to AIAgent, causing output truncation on Ollama Cloud / zai / custom endpoints #20741