Bug: max_tokens not read from custom_providers per-model config, always defaults to 4096

## Bug Description

When using a custom provider (e.g., xfyun) with a per-model `max_tokens` configured under `custom_providers[].models.<model>.max_tokens`, Hermes **ignores** this value and always defaults to **4096** output tokens.

## Root Cause

In `run_agent.py`, `context_length` is correctly read from `custom_providers` per-model config on startup (lines 1896-1946), but there is **no equivalent code for `max_tokens`**.

The constructor sets `self.max_tokens = max_tokens` (default `None` at line 1208), and when `None`, the API call falls back to `self.max_tokens or 4096` (line 8295).

## Steps to Reproduce

1. Configure a custom provider with per-model `max_tokens`:

```yaml
custom_providers:
  - name: xfyun
    base_url: https://maas-coding-api.cn-huabei-1.xf-yun.com/v2
    api_key: ${API_KEY}
    api_mode: chat_completions
    model: astron-code-latest
    models:
      astron-code-latest:
        context_length: 200000
        max_tokens: 32000
        reasoning: true
```

2. Start a session with `gateway run --replace`
3. Ask the agent to generate a long response
4. Observe `Response truncated (finish_reason='length') - model hit max output tokens` in the gateway log, with output capped at ~4096 tokens

## Expected Behavior

Hermes should read `max_tokens` from `custom_providers[].models.<model>.max_tokens` (when present and valid) and use it as the output token limit, just as it already does for `context_length`.

## Fix (tested and working)

Insert this block after the existing `context_length` custom_providers lookup in `run_agent.py` (after the `_ensure_lmstudio_runtime_loaded` call):

```python
# Also read max_tokens from custom_providers per-model config
if self.max_tokens is None and _custom_providers:
    _target = self.base_url.rstrip("/") if self.base_url else ""
    for _cp_entry in _custom_providers:
        if not isinstance(_cp_entry, dict):
            continue
        _cp_url = (_cp_entry.get("base_url") or "").rstrip("/")
        if _target and _cp_url == _target:
            _cp_models = _cp_entry.get("models", {})
            if isinstance(_cp_models, dict):
                _cp_model_cfg = _cp_models.get(self.model, {})
                if isinstance(_cp_model_cfg, dict):
                    _cp_mt = _cp_model_cfg.get("max_tokens")
                    if _cp_mt is not None:
                        try:
                            _parsed_mt = int(_cp_mt)
                            if _parsed_mt > 0:
                                self.max_tokens = _parsed_mt
                        except (TypeError, ValueError):
                            pass
            break
```

## Related Issues

- Issue #8550: Similar problem for compression model context_length (already fixed)
- Issue #15779: Similar problem for /model switch context_length resolution

## Environment

- Hermes version: v0.10.0+ (config_version 23)
- Profile: custom profile with custom_providers
- Provider: OpenAI-compatible custom endpoint


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: max_tokens not read from custom_providers per-model config, always defaults to 4096 #28046

Bug Description

Root Cause

Steps to Reproduce

Expected Behavior

Fix (tested and working)

Related Issues

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug: max_tokens not read from custom_providers per-model config, always defaults to 4096 #28046

Description

Bug Description

Root Cause

Steps to Reproduce

Expected Behavior

Fix (tested and working)

Related Issues

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions