Skip to content

feat(config): support model.max_tokens override in config.yaml#18445

Closed
leon7609 wants to merge 1 commit into
NousResearch:mainfrom
leon7609:feat/config-max-tokens-override
Closed

feat(config): support model.max_tokens override in config.yaml#18445
leon7609 wants to merge 1 commit into
NousResearch:mainfrom
leon7609:feat/config-max-tokens-override

Conversation

@leon7609

@leon7609 leon7609 commented May 1, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds a config-driven max_tokens override for OpenAI-compatible custom providers, mirroring the existing model.context_length lookup.

Custom providers (vLLM, llama.cpp, ollama, ...) that do not advertise a max_tokens default via /models cause responses to truncate with finish_reason='length' because the agent never sends a max_tokens hint. The chat_completions transport leaves max_tokens unset, the server uses its own (often conservative) default, and long answers get cut mid-sentence.

This PR introduces:

  1. A new get_max_tokens_from_config(model, base_url, config) helper in hermes_cli/config.py — alongside the existing get_custom_provider_context_length.
  2. A wire-up in AIAgent.__init__ that calls the helper after the existing context_length resolution and applies its value to self.max_tokens only when the constructor didn't pass an explicit value.

Lookup order (first valid positive int wins):

  • model.max_tokens (top-level — applies to whichever model is active)
  • custom_providers.<>.models.<model>.max_tokens (per-model, scoped to the entry whose base_url matches)

No behaviour change for built-in providers (Anthropic / OpenAI / OpenRouter / Bedrock / Gemini) — they still resolve max_tokens through their own adapter logic.

Aligns with the schema proposed in issue #15037.

Related Issue

Fixes the user-side workaround needed for #15037.

Type of Change

  • ✨ New feature (non-breaking change that adds functionality)

Changes Made

  • hermes_cli/config.py — add get_max_tokens_from_config(model, base_url, config=None) -> Optional[int]:
    • Reads top-level model.max_tokens first; logs a warning and falls through if non-int / non-positive.
    • Falls back to custom_providers.<>.models.<>.max_tokens (URL match is trailing-slash insensitive, value must be a positive int).
    • Returns None when neither location holds a valid value.
  • run_agent.pyAIAgent.__init__: after the existing context_length resolution, call the helper and apply the value to self.max_tokens if the constructor didn't pass one.
  • tests/hermes_cli/test_max_tokens_config.py — 15 unit tests covering top-level / per-model / precedence / edge cases / providers-dict (v12+) form.

How to Test

  1. Configure a custom provider in ~/.hermes/config.yaml:
    model:
      default: qwen3.6-27b-fp8
      provider: custom
      base_url: http://192.168.x.x:8080/v1
      max_tokens: 32768
  2. Send a prompt likely to produce a long response (e.g. "summarise this 50K-word document").
  3. Before this fix: the response gets cut off at the server's default cap (often 2K–4K) with finish_reason='length'.
  4. After this fix: the agent sends max_tokens=32768, and the response runs to a natural stop.

Or run the unit tests:

pytest tests/hermes_cli/test_max_tokens_config.py -v

Checklist

Code

Documentation & Housekeeping

  • No breaking change to the schema — model.max_tokens is opt-in
  • Existing context_length semantics unchanged

Custom OpenAI-compatible providers (vLLM, llama.cpp, ollama, ...) that
do not advertise a max_tokens default via /models cause responses to
truncate with `finish_reason='length'` because the agent never sends
a max_tokens hint.  The chat_completions transport leaves max_tokens
unset, the server uses its own (often conservative) default, and long
answers get cut mid-sentence.

Mirror the existing `model.context_length` override pattern:

  - Top-level `model.max_tokens` (preferred)
  - Legacy `custom_providers.<>.models.<>.max_tokens` (per-model)

Read in AIAgent.__init__ right after the context_length resolution
block; apply only when the constructor did not pass an explicit
max_tokens.  No behaviour change for built-in providers (Anthropic /
OpenAI / OpenRouter) — they still resolve max_tokens via their own
adapter logic.

Aligns with the schema proposed in upstream issue NousResearch#15037.
@alt-glitch alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have comp/agent Core agent loop, run_agent.py, prompt builder area/config Config system, migrations, profiles labels May 1, 2026
@leon7609

Copy link
Copy Markdown
Contributor Author

Closing as superseded by upstream.

v0.14.0 (v2026.5.16) ships native model.max_tokens support in run_agent.py via commit a78e622 ("fix(agent): honor configured model max tokens"). It reads model.max_tokens from config.yaml with bool / non-positive / int-parse validation and an invalid-value warning — functionally equivalent to this PR's intent (which was aligned with #15037). Carrying a parallel get_max_tokens_from_config helper would only duplicate the behavior and conflict on rebase.

Verified locally on a 0.13→0.14 upgrade: dropped this patch, upstream's native path covers the same config-driven override. Thanks for the consideration!

@leon7609 leon7609 closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/config Config system, migrations, profiles comp/agent Core agent loop, run_agent.py, prompt builder P3 Low — cosmetic, nice to have type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants