Skip to content

feat(custom_providers): per-model max_tokens with switch/fallback re-resolution#35518

Open
banditburai wants to merge 4 commits into
NousResearch:mainfrom
banditburai:worktree-fix+custom-provider-max-tokens
Open

feat(custom_providers): per-model max_tokens with switch/fallback re-resolution#35518
banditburai wants to merge 4 commits into
NousResearch:mainfrom
banditburai:worktree-fix+custom-provider-max-tokens

Conversation

@banditburai

Copy link
Copy Markdown
Contributor

Honor custom_providers[].models.<model>.max_tokens end-to-end so a per-model output cap is never silently dropped (truncation) and never leaks onto another model.

Problem

A configured max_tokens had no effect — output silently truncated on Ollama Cloud / zai / OpenAI-compatible proxies. This is not a hard 4096 floor: the only literal 4096 fallback on the request path is on the bedrock call (agent.max_tokens or 4096 at chat_completion_helpers.py:563, consumed by transports/bedrock.py:58). The real symptom is that the field was never set, so each provider fell back to its own small default ceiling.

Root cause

A propagation gap: the CLI never passed max_tokens= into the AIAgent constructor, so agent.max_tokens stayed None. With None, providers applied their own default → truncation. Omitting max_tokens when nothing is configured is the desired behavior — the model's native output ceiling is used (anthropic_adapter.py:2072); the bug was that a configured value never reached agent.max_tokens. The fix mirrors the existing context_length resolve/propagate machinery.

Solution (4 core files)

  • Reader — new get_custom_provider_max_tokens (hermes_cli/config.py:3662), sibling of get_custom_provider_context_length: trailing-slash-insensitive base_url match, positive-int only, bool rejected (it is an int subclass but never a valid cap).
  • Resolver — new _resolve_max_tokens (agent/agent_init.py:136), called at agent_init.py:1452 after _custom_providers is resolved (:1394) so the per-model value can win. The old inline global-only block (~:1352) is deleted and subsumed with identical int/bool/<=0 validation and warning text — no behavior loss on the global path. Persists _config_max_tokens and _max_tokens_explicit; both keys are seeded in the _primary_runtime base literal (:1700).
  • Reaches the wire (proof, not a dangling field) — agent.max_tokens is the single field consumed across every transport: chat_completions transports/chat_completions.py:298,486 (guarded is not None), anthropic anthropic_adapter.py:2168, bedrock transports/bedrock.py:58, direct-OpenAI/o1 mapped to max_completion_tokens run_agent.py:1206.

Precedence

explicit (constructor/CLI) > per-model custom_providers > global model.max_tokens > None (provider native ceiling)

Precedence inversion (deliberate): per-model is checked before the global value — the opposite of context_length (global-first). A provider-specific output cap is the more-specific intent and must win. Blast radius is contained: context_length precedence is untouched, and _KNOWN_KEYS is unchanged (an entry-level max_tokens still fires the existing "unknown config keys ignored" warning; the supported location is models.<model>.max_tokens, nested-only by design).

Initial resolve + 3 re-resolution seams

The config-resolved cap is null-and-re-resolved against the active model at each transition, faithfully mirroring _config_context_length:

  • switch_model — re-resolves from live load_config(); the global never re-applies (no leak). Snapshotted before the try (agent_runtime_helpers.py:1409-1410), nulled inside the try (:1421), re-resolved (:1532-1547), and rolled back on a failed client build (:1505-1513).
  • try_activate_fallback — re-resolves from the cached agent._custom_providers (chat_completion_helpers.py:1138,1148-1158).
  • restore_primary_runtime — tolerant rt.get reapply of the primary snapshot (agent_runtime_helpers.py:957-959).

Leak guarantees (#28782 — both directions)

  • Forward (global → fallback): the global cap belongs to the primary only and is nulled on switch/fallback, so it never leaks onto a fallback provider.
  • Backward (fallback → restored primary): a fallback's re-resolved cap never sticks to the restored primary — restore_primary_runtime reapplies the primary's snapshot.
  • Explicit values survive every transition, gated by _max_tokens_explicit at every mutation site (switch null + re-resolve, fallback null + re-resolve, restore reapply) — a user's --max_tokens is never clobbered.

Testing

pytest tests/hermes_cli/test_custom_provider_max_tokens.py \
       tests/run_agent/test_fallback_max_tokens.py \
       tests/run_agent/test_switch_model_max_tokens.py        # 138 passed
ruff check hermes_cli/config.py agent/agent_init.py \
           agent/agent_runtime_helpers.py agent/chat_completion_helpers.py tests/   # clean

A broader sweep (test_primary_runtime_restore.py, test_switch_model_context.py + the two new files) is 42 passed / 1 failed; the one failure is test_primary_runtime_restore.py::TestTryRecoverPrimaryTransport::test_allowed_for_anthropic_direct — an environmental anthropic lazy-install ImportError (anthropic_adapter.py:680) on an anthropic-only path this diff does not touch. It fails on baseline too — not a regression.

Coverage

  • Resolver: matching / non-matching base_url, missing model, numeric-string coercion, zero/negative, non-numeric, bool rejection, trailing-slash (both directions), empty inputs.
  • Precedence / config shape: per-model lands, per-model beats global, global fallback, explicit wins, bare-string model: config, both invalid-value warning paths.
  • Lifecycle (all mutation sites): switch (re-resolve + drop-global + explicit-survive + failed-swap rollback), fallback (re-resolve + drop-global + explicit-survive), restore (reapply + explicit-preserve).
  • Tests assert behavior not implementation; zero time/random/flaky dependencies. Switch tests mock the resolver (seam isolation); fallback tests use the real resolver (integration) — deliberately complementary.

Known behaviors (disclosed)

  • Two deliberate asymmetries cloned verbatim from the existing _config_context_length machinery: switch_model re-resolves from live config but does not write back agent._custom_providers, so a switch→fallback uses the startup snapshot (fix-both-or-neither, intentionally not diverging here); the fallback path has no try/except rollback (the same pre-existing exposure already present for model/provider/_config_context_length, not new surface).
  • Cosmetic: the warning-only _raw_per_model_max_tokens returns on the first base_url match rather than full-scanning, so a pathological duplicate-base_url config could miss (never mis-emit) an invalid-value warning. Low severity, untested.
  • _primary_runtime is never serialized (pure in-memory, rebuilt by init_agent on every start), so the rt.get → None tolerance on restore is defensive only and never bites a real upgraded session.

Issue / PR linkage

Closes #15037, #20004, #28046, #28782
Addresses #20741
Supersedes #28142, #28154, #28786; supersedes the max_tokens slice of #28995

Not in scope (tracked separately)

Sibling of get_custom_provider_context_length: reads per-model
custom_providers[].models.<model>.max_tokens (positive int, bool-rejecting,
trailing-slash-insensitive base_url match). No call sites yet. Refs NousResearch#28046.
Resolve agent.max_tokens with precedence explicit > per-model
custom_providers[].models.<m>.max_tokens > global model.max_tokens. Per-model
wins over global (specificity); context_length is unchanged. Persist
_config_max_tokens / _max_tokens_explicit for switch/fallback re-resolution.
Refs NousResearch#28046, NousResearch#15037, NousResearch#28782, NousResearch#20741.
Null-and-re-resolve the config-derived max_tokens against the new model on
/model switch and fallback activation, mirroring _config_context_length, and
add it to the switch rollback snapshot. The global model.max_tokens applies to
the primary model only and no longer leaks onto fallbacks; explicit
constructor values survive switches. Also harden _resolve_max_tokens against a
bare-string model: config (surfaced by the integration tests). Refs NousResearch#28782, NousResearch#15037.
Carry max_tokens / _config_max_tokens in _primary_runtime and reapply them in
restore_primary_runtime so a fallback's re-resolved cap doesn't persist onto
the recovered primary model. Completes the no-leak guarantee end-to-end.
Closes NousResearch#15037, NousResearch#28046, NousResearch#28782. Addresses NousResearch#20741, NousResearch#20004.
@alt-glitch alt-glitch added type/feature New feature or request P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard area/config Config system, migrations, profiles labels May 30, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Superset of the existing max_tokens cluster: #20149, #28142, #28154, #28786. Adds switch_model/fallback re-resolution and bool rejection that earlier PRs lack. Related issues: #15037, #20004, #20741, #28046, #28782. Also overlaps with #29705 (per-model max_tokens overlay for built-in providers).

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: feat(custom_providers): per-model max_tokens with switch/fallback re-resolution

Author: banditburai
Branch: worktree-fix+custom-provider-max-tokens → main
Files changed: 7 files (+490 / -26)
headRefOid: a22369c

Overview

This PR propagates a per-model max_tokens output cap from custom_providers[].models.<model>.max_tokens to agent.max_tokens, fixing a long-standing bug where a configured max_tokens never reached the request wire, causing silent truncation on Ollama Cloud, zai, and OpenAI-compatible proxies.

The solution mirrors the existing context_length machinery with careful re-resolution across three mid-session seams (switch_model, try_activate_fallback, restore_primary_runtime), plus leak guarantees in both directions (#28782).

Correctness (9/10)

  • Precedence is clearly documented and correct: explicit > per-model custom_providers > global model.max_tokens > None. The deliberate inversion (per-model checked before global, opposite of context_length) makes semantic sense — a provider-specific output cap should win.
  • Re-resolution at all three seams (switch, fallback, restore) faithfully mirrors the _config_context_length pattern.
  • Leak guarantees are implemented symmetrically: the global is nulled on switch/fallback (forward), and restore_primary_runtime reapplies the primary's snapshot (backward). Explicit values are gated by _max_tokens_explicit at every mutation site.

Minor concern: The _raw_per_model_max_tokens() fallback warning-path function in _resolve_max_tokens scans for the first matching base_url and returns early (return inside a for+continue block). If a user has duplicate base_url entries in custom_providers, an invalid value on the second duplicate would never produce a warning. The PR doc explicitly acknowledges this as low-severity and untested. Acceptable for v1, but worth flagging.

Security (10/10)

  • No new network exposure, shell execution paths, or unsafe deserialization.
  • base_url matching is trailing-slash-insensitive (safe normalization).
  • bool is properly rejected (it's an int subclass but never a valid cap — prevents True1 token truncation).
  • int() coercion with explicit TypeError/ValueError handling — no silent overflow or injection risk.
  • No changes to authentication or credential handling.

Code Quality (9/10)

  • Excellent docstrings — every new function has a clear docstring describing purpose, precedence, and behavior.
  • Inline comments are thorough and explanatory, not redundant.
  • Consistent naming: _resolve_max_tokens parallels _config_context_length; get_custom_provider_max_tokens parallels get_custom_provider_context_length.
  • Defensive programming: getattr(agent, "_custom_providers", None) or [], rt.get() tolerance on restore, _ra().logger.warning + print(stderr) dual warning (matching existing pattern).
  • Diff is minimal and focused — exactly what's needed, no scope creep.

Nitpicks:

  • _raw_per_model_max_tokens() is defined as a nested function inside _resolve_max_tokens() with a very similar purpose to get_custom_provider_max_tokens(). The duplication is intentional (one is the validation/warning path, the other is the clean resolver), but it would be cleaner to share the lookup logic. The PR doc justifies this: the warning-only path returns on first match rather than full-scanning, which is a slightly different contract. Still, a shared _find_custom_provider_model_cfg(base_url, model, custom_providers) helper could reduce the code duplication.
  • The except Exception catch-alls on the resolver calls (e.g., try: from hermes_cli.config import get_custom_provider_max_tokens ... except Exception: cp_mt = None) are a bit broad. These are lazy imports and config reads — ImportError and KeyError/TypeError are expected; KeyboardInterrupt or SystemExit would be swallowed. For a cron/code-review context this is fine (existing code does the same pattern), but ideally catch specific exceptions.

Testing (10/10)

  • 138 new tests across 3 test files — excellent coverage.
  • Unit tests for get_custom_provider_max_tokens(): matching/non-matching base_url, missing model, numeric-string coercion, zero/negative, non-numeric, bool rejection, trailing-slash (both directions), empty inputs. All parameterized where appropriate.
  • Precedence tests for _resolve_max_tokens(): per-model lands, per-model beats global, global fallback, explicit wins, invalid values warn, bare-string model config tolerated.
  • Lifecycle tests (switch, fallback, restore): explicit survive, global dropped, per-model re-resolved, rollback on failed switch. Comprehensive.
  • Clean assertions — tests assert behavior, not implementation. No time/random/flaky dependencies.
  • The PR author notes 138/138 tests pass on these files, and the one unrelated failure is an environmental anthropic import error that also fails on baseline.

Performance (10/10)

  • Config read is at init time only — no per-request overhead.
  • Re-resolution on switch/fallback involves load_config() (cached by Python YAML loader) plus a simple linear scan of custom_providers — typically <1ms.
  • _primary_runtime snapshot stores max_tokens as a plain Python int/none field — negligible memory impact on a dataclass that already stores 20+ fields.

Summary

Category Score
Correctness 9/10
Security 10/10
Code Quality 9/10
Testing 10/10
Performance 10/10

Decision: APPROVE (minor suggestions only, no blockers)

Suggested Follow-ups (not blocking)

  1. Consider extracting a shared _find_custom_provider_model_cfg(base_url, model, custom_providers) helper to eliminate the structural duplication between get_custom_provider_max_tokens() and _raw_per_model_max_tokens() / the context_length warning block.
  2. Narrow the except Exception wrappers to (ImportError, TypeError, KeyError) where feasible.
  3. Track the base_url duplicate warning gap as a follow-up issue for completeness.

Closes #15037, #20004, #28046, #28782. Well-structured, well-tested, minimal blast radius.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/config Config system, migrations, profiles comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard P2 Medium — degraded but workaround exists type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(config): support per-model max_tokens in custom_providers config

3 participants