feat(config): support model.max_tokens override in config.yaml#18445
Closed
leon7609 wants to merge 1 commit into
Closed
feat(config): support model.max_tokens override in config.yaml#18445leon7609 wants to merge 1 commit into
leon7609 wants to merge 1 commit into
Conversation
Custom OpenAI-compatible providers (vLLM, llama.cpp, ollama, ...) that do not advertise a max_tokens default via /models cause responses to truncate with `finish_reason='length'` because the agent never sends a max_tokens hint. The chat_completions transport leaves max_tokens unset, the server uses its own (often conservative) default, and long answers get cut mid-sentence. Mirror the existing `model.context_length` override pattern: - Top-level `model.max_tokens` (preferred) - Legacy `custom_providers.<>.models.<>.max_tokens` (per-model) Read in AIAgent.__init__ right after the context_length resolution block; apply only when the constructor did not pass an explicit max_tokens. No behaviour change for built-in providers (Anthropic / OpenAI / OpenRouter) — they still resolve max_tokens via their own adapter logic. Aligns with the schema proposed in upstream issue NousResearch#15037.
Contributor
Author
|
Closing as superseded by upstream. v0.14.0 (v2026.5.16) ships native Verified locally on a 0.13→0.14 upgrade: dropped this patch, upstream's native path covers the same config-driven override. Thanks for the consideration! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds a config-driven
max_tokensoverride for OpenAI-compatible custom providers, mirroring the existingmodel.context_lengthlookup.Custom providers (vLLM, llama.cpp, ollama, ...) that do not advertise a
max_tokensdefault via/modelscause responses to truncate withfinish_reason='length'because the agent never sends amax_tokenshint. Thechat_completionstransport leavesmax_tokensunset, the server uses its own (often conservative) default, and long answers get cut mid-sentence.This PR introduces:
get_max_tokens_from_config(model, base_url, config)helper inhermes_cli/config.py— alongside the existingget_custom_provider_context_length.AIAgent.__init__that calls the helper after the existingcontext_lengthresolution and applies its value toself.max_tokensonly when the constructor didn't pass an explicit value.Lookup order (first valid positive int wins):
model.max_tokens(top-level — applies to whichever model is active)custom_providers.<>.models.<model>.max_tokens(per-model, scoped to the entry whosebase_urlmatches)No behaviour change for built-in providers (Anthropic / OpenAI / OpenRouter / Bedrock / Gemini) — they still resolve
max_tokensthrough their own adapter logic.Aligns with the schema proposed in issue #15037.
Related Issue
Fixes the user-side workaround needed for #15037.
Type of Change
Changes Made
hermes_cli/config.py— addget_max_tokens_from_config(model, base_url, config=None) -> Optional[int]:model.max_tokensfirst; logs a warning and falls through if non-int / non-positive.custom_providers.<>.models.<>.max_tokens(URL match is trailing-slash insensitive, value must be a positive int).Nonewhen neither location holds a valid value.run_agent.py—AIAgent.__init__: after the existingcontext_lengthresolution, call the helper and apply the value toself.max_tokensif the constructor didn't pass one.tests/hermes_cli/test_max_tokens_config.py— 15 unit tests covering top-level / per-model / precedence / edge cases / providers-dict (v12+) form.How to Test
~/.hermes/config.yaml:finish_reason='length'.max_tokens=32768, and the response runs to a natural stop.Or run the unit tests:
Checklist
Code
feat(config): ...)pytest tests/ -qand the changes don't introduce new failures (vs. main baseline)Documentation & Housekeeping
model.max_tokensis opt-incontext_lengthsemantics unchanged