feat(custom_providers): per-model max_tokens with switch/fallback re-resolution by banditburai · Pull Request #35518 · NousResearch/hermes-agent

banditburai · 2026-05-30T20:33:17Z

Honor custom_providers[].models.<model>.max_tokens end-to-end so a per-model output cap is never silently dropped (truncation) and never leaks onto another model.

Problem

A configured max_tokens had no effect — output silently truncated on Ollama Cloud / zai / OpenAI-compatible proxies. This is not a hard 4096 floor: the only literal 4096 fallback on the request path is on the bedrock call (agent.max_tokens or 4096 at chat_completion_helpers.py:563, consumed by transports/bedrock.py:58). The real symptom is that the field was never set, so each provider fell back to its own small default ceiling.

Root cause

A propagation gap: the CLI never passed max_tokens= into the AIAgent constructor, so agent.max_tokens stayed None. With None, providers applied their own default → truncation. Omitting max_tokens when nothing is configured is the desired behavior — the model's native output ceiling is used (anthropic_adapter.py:2072); the bug was that a configured value never reached agent.max_tokens. The fix mirrors the existing context_length resolve/propagate machinery.

Solution (4 core files)

Reader — new get_custom_provider_max_tokens (hermes_cli/config.py:3662), sibling of get_custom_provider_context_length: trailing-slash-insensitive base_url match, positive-int only, bool rejected (it is an int subclass but never a valid cap).
Resolver — new _resolve_max_tokens (agent/agent_init.py:136), called at agent_init.py:1452 after _custom_providers is resolved (:1394) so the per-model value can win. The old inline global-only block (~:1352) is deleted and subsumed with identical int/bool/<=0 validation and warning text — no behavior loss on the global path. Persists _config_max_tokens and _max_tokens_explicit; both keys are seeded in the _primary_runtime base literal (:1700).
Reaches the wire (proof, not a dangling field) — agent.max_tokens is the single field consumed across every transport: chat_completions transports/chat_completions.py:298,486 (guarded is not None), anthropic anthropic_adapter.py:2168, bedrock transports/bedrock.py:58, direct-OpenAI/o1 mapped to max_completion_tokens run_agent.py:1206.

Precedence

explicit (constructor/CLI) > per-model custom_providers > global model.max_tokens > None (provider native ceiling)

Precedence inversion (deliberate): per-model is checked before the global value — the opposite of context_length (global-first). A provider-specific output cap is the more-specific intent and must win. Blast radius is contained: context_length precedence is untouched, and _KNOWN_KEYS is unchanged (an entry-level max_tokens still fires the existing "unknown config keys ignored" warning; the supported location is models.<model>.max_tokens, nested-only by design).

Initial resolve + 3 re-resolution seams

The config-resolved cap is null-and-re-resolved against the active model at each transition, faithfully mirroring _config_context_length:

switch_model — re-resolves from live load_config(); the global never re-applies (no leak). Snapshotted before the try (agent_runtime_helpers.py:1409-1410), nulled inside the try (:1421), re-resolved (:1532-1547), and rolled back on a failed client build (:1505-1513).
try_activate_fallback — re-resolves from the cached agent._custom_providers (chat_completion_helpers.py:1138,1148-1158).
restore_primary_runtime — tolerant rt.get reapply of the primary snapshot (agent_runtime_helpers.py:957-959).

Leak guarantees (#28782 — both directions)

Forward (global → fallback): the global cap belongs to the primary only and is nulled on switch/fallback, so it never leaks onto a fallback provider.
Backward (fallback → restored primary): a fallback's re-resolved cap never sticks to the restored primary — restore_primary_runtime reapplies the primary's snapshot.
Explicit values survive every transition, gated by _max_tokens_explicit at every mutation site (switch null + re-resolve, fallback null + re-resolve, restore reapply) — a user's --max_tokens is never clobbered.

Testing

pytest tests/hermes_cli/test_custom_provider_max_tokens.py \
       tests/run_agent/test_fallback_max_tokens.py \
       tests/run_agent/test_switch_model_max_tokens.py        # 138 passed
ruff check hermes_cli/config.py agent/agent_init.py \
           agent/agent_runtime_helpers.py agent/chat_completion_helpers.py tests/   # clean

A broader sweep (test_primary_runtime_restore.py, test_switch_model_context.py + the two new files) is 42 passed / 1 failed; the one failure is test_primary_runtime_restore.py::TestTryRecoverPrimaryTransport::test_allowed_for_anthropic_direct — an environmental anthropic lazy-install ImportError (anthropic_adapter.py:680) on an anthropic-only path this diff does not touch. It fails on baseline too — not a regression.

Coverage

Resolver: matching / non-matching base_url, missing model, numeric-string coercion, zero/negative, non-numeric, bool rejection, trailing-slash (both directions), empty inputs.
Precedence / config shape: per-model lands, per-model beats global, global fallback, explicit wins, bare-string model: config, both invalid-value warning paths.
Lifecycle (all mutation sites): switch (re-resolve + drop-global + explicit-survive + failed-swap rollback), fallback (re-resolve + drop-global + explicit-survive), restore (reapply + explicit-preserve).
Tests assert behavior not implementation; zero time/random/flaky dependencies. Switch tests mock the resolver (seam isolation); fallback tests use the real resolver (integration) — deliberately complementary.

Known behaviors (disclosed)

Two deliberate asymmetries cloned verbatim from the existing _config_context_length machinery: switch_model re-resolves from live config but does not write back agent._custom_providers, so a switch→fallback uses the startup snapshot (fix-both-or-neither, intentionally not diverging here); the fallback path has no try/except rollback (the same pre-existing exposure already present for model/provider/_config_context_length, not new surface).
Cosmetic: the warning-only _raw_per_model_max_tokens returns on the first base_url match rather than full-scanning, so a pathological duplicate-base_url config could miss (never mis-emit) an invalid-value warning. Low severity, untested.
_primary_runtime is never serialized (pure in-memory, rebuilt by init_agent on every start), so the rt.get → None tolerance on restore is defensive only and never bites a real upgraded session.

Issue / PR linkage

Closes #15037, #20004, #28046, #28782
Addresses #20741
Supersedes #28142, #28154, #28786; supersedes the max_tokens slice of #28995

Closes — feat(config): support per-model max_tokens in custom_providers config #15037 (the feature itself), max_tokens config from custom_providers is not passed to AIAgent #20004 (custom_providers max_tokens now reaches AIAgent — the exact net-new behavior of this diff), Bug: max_tokens not read from custom_providers per-model config, always defaults to 4096 #28046 (the always-4096 default is verifiably gone), feat: custom_providers should support per-provider max_tokens override #28782 (per-model override in custom_providers + both leak directions fully owned).
Addresses, not Closes — Bug: max_tokens from config.yaml is silently ignored — never propagated to AIAgent, causing output truncation on Ollama Cloud / zai / custom endpoints #20741: its stated root cause is the global config.yaml model.max_tokens path, which this diff preserves (relocates) rather than introduces — that path already propagated on baseline. The diff does fix the custom-endpoint truncation it reports (via the per-model read), but does not exclusively own the issue's global-config framing, so closure is left conservative.
Supersedes — fix(agent_init): read max_tokens from custom_providers per-model config #28142 / fix(agent): honor custom provider max_tokens #28154 / feat: per-provider max_tokens via custom_providers models.<model>.max_tokens #28786 are a clean strict superset (same per-model read via the shared resolver, plus bool rejection and switch/fallback/restore re-resolution none of them had); only the max_tokens slice of feat: Config-Runtime Contract Registry (Phase 1) + fix #28046 + fix #28863 #28995 is superseded.

Not in scope (tracked separately)

terminal.docker_extra_args from config.yaml silently dropped — missing in _terminal_env_map bridge #28863 — terminal.docker_extra_args missing from gateway/run.py _terminal_env_map (a one-line gateway fix feat: Config-Runtime Contract Registry (Phase 1) + fix #28046 + fix #28863 #28995 happened to bundle). Not included here; still open; will be filed as a separate follow-up. This PR does not claim Closes #28863.
[Feature]: Typed Config-Runtime Contract — eliminate silent config/state/hook binding gaps #28984 — the Config-Runtime Contract Registry (the other half of feat: Config-Runtime Contract Registry (Phase 1) + fix #28046 + fix #28863 #28995) is an independent feature and is intentionally not carried here.

Sibling of get_custom_provider_context_length: reads per-model custom_providers[].models.<model>.max_tokens (positive int, bool-rejecting, trailing-slash-insensitive base_url match). No call sites yet. Refs NousResearch#28046.

Resolve agent.max_tokens with precedence explicit > per-model custom_providers[].models.<m>.max_tokens > global model.max_tokens. Per-model wins over global (specificity); context_length is unchanged. Persist _config_max_tokens / _max_tokens_explicit for switch/fallback re-resolution. Refs NousResearch#28046, NousResearch#15037, NousResearch#28782, NousResearch#20741.

Null-and-re-resolve the config-derived max_tokens against the new model on /model switch and fallback activation, mirroring _config_context_length, and add it to the switch rollback snapshot. The global model.max_tokens applies to the primary model only and no longer leaks onto fallbacks; explicit constructor values survive switches. Also harden _resolve_max_tokens against a bare-string model: config (surfaced by the integration tests). Refs NousResearch#28782, NousResearch#15037.

Carry max_tokens / _config_max_tokens in _primary_runtime and reapply them in restore_primary_runtime so a fallback's re-resolved cap doesn't persist onto the recovered primary model. Completes the no-leak guarantee end-to-end. Closes NousResearch#15037, NousResearch#28046, NousResearch#28782. Addresses NousResearch#20741, NousResearch#20004.

alt-glitch · 2026-05-30T20:45:56Z

Superset of the existing max_tokens cluster: #20149, #28142, #28154, #28786. Adds switch_model/fallback re-resolution and bool rejection that earlier PRs lack. Related issues: #15037, #20004, #20741, #28046, #28782. Also overlaps with #29705 (per-model max_tokens overlay for built-in providers).

tonydwb

Code Review: feat(custom_providers): per-model max_tokens with switch/fallback re-resolution

Author: banditburai
Branch: worktree-fix+custom-provider-max-tokens → main
Files changed: 7 files (+490 / -26)
headRefOid: a22369c

Overview

This PR propagates a per-model max_tokens output cap from custom_providers[].models.<model>.max_tokens to agent.max_tokens, fixing a long-standing bug where a configured max_tokens never reached the request wire, causing silent truncation on Ollama Cloud, zai, and OpenAI-compatible proxies.

The solution mirrors the existing context_length machinery with careful re-resolution across three mid-session seams (switch_model, try_activate_fallback, restore_primary_runtime), plus leak guarantees in both directions (#28782).

Correctness (9/10)

Precedence is clearly documented and correct: explicit > per-model custom_providers > global model.max_tokens > None. The deliberate inversion (per-model checked before global, opposite of context_length) makes semantic sense — a provider-specific output cap should win.
Re-resolution at all three seams (switch, fallback, restore) faithfully mirrors the _config_context_length pattern.
Leak guarantees are implemented symmetrically: the global is nulled on switch/fallback (forward), and restore_primary_runtime reapplies the primary's snapshot (backward). Explicit values are gated by _max_tokens_explicit at every mutation site.

Minor concern: The _raw_per_model_max_tokens() fallback warning-path function in _resolve_max_tokens scans for the first matching base_url and returns early (return inside a for+continue block). If a user has duplicate base_url entries in custom_providers, an invalid value on the second duplicate would never produce a warning. The PR doc explicitly acknowledges this as low-severity and untested. Acceptable for v1, but worth flagging.

Security (10/10)

No new network exposure, shell execution paths, or unsafe deserialization.
base_url matching is trailing-slash-insensitive (safe normalization).
bool is properly rejected (it's an int subclass but never a valid cap — prevents True → 1 token truncation).
int() coercion with explicit TypeError/ValueError handling — no silent overflow or injection risk.
No changes to authentication or credential handling.

Code Quality (9/10)

Excellent docstrings — every new function has a clear docstring describing purpose, precedence, and behavior.
Inline comments are thorough and explanatory, not redundant.
Consistent naming: _resolve_max_tokens parallels _config_context_length; get_custom_provider_max_tokens parallels get_custom_provider_context_length.
Defensive programming: getattr(agent, "_custom_providers", None) or [], rt.get() tolerance on restore, _ra().logger.warning + print(stderr) dual warning (matching existing pattern).
Diff is minimal and focused — exactly what's needed, no scope creep.

Nitpicks:

_raw_per_model_max_tokens() is defined as a nested function inside _resolve_max_tokens() with a very similar purpose to get_custom_provider_max_tokens(). The duplication is intentional (one is the validation/warning path, the other is the clean resolver), but it would be cleaner to share the lookup logic. The PR doc justifies this: the warning-only path returns on first match rather than full-scanning, which is a slightly different contract. Still, a shared _find_custom_provider_model_cfg(base_url, model, custom_providers) helper could reduce the code duplication.
The except Exception catch-alls on the resolver calls (e.g., try: from hermes_cli.config import get_custom_provider_max_tokens ... except Exception: cp_mt = None) are a bit broad. These are lazy imports and config reads — ImportError and KeyError/TypeError are expected; KeyboardInterrupt or SystemExit would be swallowed. For a cron/code-review context this is fine (existing code does the same pattern), but ideally catch specific exceptions.

Testing (10/10)

138 new tests across 3 test files — excellent coverage.
Unit tests for get_custom_provider_max_tokens(): matching/non-matching base_url, missing model, numeric-string coercion, zero/negative, non-numeric, bool rejection, trailing-slash (both directions), empty inputs. All parameterized where appropriate.
Precedence tests for _resolve_max_tokens(): per-model lands, per-model beats global, global fallback, explicit wins, invalid values warn, bare-string model config tolerated.
Lifecycle tests (switch, fallback, restore): explicit survive, global dropped, per-model re-resolved, rollback on failed switch. Comprehensive.
Clean assertions — tests assert behavior, not implementation. No time/random/flaky dependencies.
The PR author notes 138/138 tests pass on these files, and the one unrelated failure is an environmental anthropic import error that also fails on baseline.

Performance (10/10)

Config read is at init time only — no per-request overhead.
Re-resolution on switch/fallback involves load_config() (cached by Python YAML loader) plus a simple linear scan of custom_providers — typically <1ms.
_primary_runtime snapshot stores max_tokens as a plain Python int/none field — negligible memory impact on a dataclass that already stores 20+ fields.

Summary

Category	Score
Correctness	9/10
Security	10/10
Code Quality	9/10
Testing	10/10
Performance	10/10

Decision: APPROVE (minor suggestions only, no blockers)

Suggested Follow-ups (not blocking)

Consider extracting a shared _find_custom_provider_model_cfg(base_url, model, custom_providers) helper to eliminate the structural duplication between get_custom_provider_max_tokens() and _raw_per_model_max_tokens() / the context_length warning block.
Narrow the except Exception wrappers to (ImportError, TypeError, KeyError) where feasible.
Track the base_url duplicate warning gap as a follow-up issue for completeness.

Closes #15037, #20004, #28046, #28782. Well-structured, well-tested, minimal blast radius.

banditburai added 4 commits May 30, 2026 16:28

feat(config): add get_custom_provider_max_tokens resolver

154f934

Sibling of get_custom_provider_context_length: reads per-model custom_providers[].models.<model>.max_tokens (positive int, bool-rejecting, trailing-slash-insensitive base_url match). No call sites yet. Refs NousResearch#28046.

alt-glitch added type/feature New feature or request P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard area/config Config system, migrations, profiles labels May 30, 2026

tonydwb approved these changes May 30, 2026

View reviewed changes

teknium1 mentioned this pull request Jun 5, 2026

fix(gateway): propagate max_tokens from config.yaml to AIAgent (#20741) #39864

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(custom_providers): per-model max_tokens with switch/fallback re-resolution#35518

feat(custom_providers): per-model max_tokens with switch/fallback re-resolution#35518
banditburai wants to merge 4 commits into
NousResearch:mainfrom
banditburai:worktree-fix+custom-provider-max-tokens

banditburai commented May 30, 2026

Uh oh!

alt-glitch commented May 30, 2026

Uh oh!

tonydwb left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

banditburai commented May 30, 2026

Problem

Root cause

Solution (4 core files)

Precedence

Initial resolve + 3 re-resolution seams

Leak guarantees (#28782 — both directions)

Testing

Coverage

Known behaviors (disclosed)

Issue / PR linkage

Not in scope (tracked separately)

Uh oh!

alt-glitch commented May 30, 2026

Uh oh!

tonydwb left a comment

Choose a reason for hiding this comment

Code Review: feat(custom_providers): per-model max_tokens with switch/fallback re-resolution

Overview

Correctness (9/10)

Security (10/10)

Code Quality (9/10)

Testing (10/10)

Performance (10/10)

Summary

Suggested Follow-ups (not blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants