feat(custom_providers): per-model max_tokens with switch/fallback re-resolution#35518
feat(custom_providers): per-model max_tokens with switch/fallback re-resolution#35518banditburai wants to merge 4 commits into
Conversation
Sibling of get_custom_provider_context_length: reads per-model custom_providers[].models.<model>.max_tokens (positive int, bool-rejecting, trailing-slash-insensitive base_url match). No call sites yet. Refs NousResearch#28046.
Resolve agent.max_tokens with precedence explicit > per-model custom_providers[].models.<m>.max_tokens > global model.max_tokens. Per-model wins over global (specificity); context_length is unchanged. Persist _config_max_tokens / _max_tokens_explicit for switch/fallback re-resolution. Refs NousResearch#28046, NousResearch#15037, NousResearch#28782, NousResearch#20741.
Null-and-re-resolve the config-derived max_tokens against the new model on /model switch and fallback activation, mirroring _config_context_length, and add it to the switch rollback snapshot. The global model.max_tokens applies to the primary model only and no longer leaks onto fallbacks; explicit constructor values survive switches. Also harden _resolve_max_tokens against a bare-string model: config (surfaced by the integration tests). Refs NousResearch#28782, NousResearch#15037.
Carry max_tokens / _config_max_tokens in _primary_runtime and reapply them in restore_primary_runtime so a fallback's re-resolved cap doesn't persist onto the recovered primary model. Completes the no-leak guarantee end-to-end. Closes NousResearch#15037, NousResearch#28046, NousResearch#28782. Addresses NousResearch#20741, NousResearch#20004.
tonydwb
left a comment
There was a problem hiding this comment.
Code Review: feat(custom_providers): per-model max_tokens with switch/fallback re-resolution
Author: banditburai
Branch: worktree-fix+custom-provider-max-tokens → main
Files changed: 7 files (+490 / -26)
headRefOid: a22369c
Overview
This PR propagates a per-model max_tokens output cap from custom_providers[].models.<model>.max_tokens to agent.max_tokens, fixing a long-standing bug where a configured max_tokens never reached the request wire, causing silent truncation on Ollama Cloud, zai, and OpenAI-compatible proxies.
The solution mirrors the existing context_length machinery with careful re-resolution across three mid-session seams (switch_model, try_activate_fallback, restore_primary_runtime), plus leak guarantees in both directions (#28782).
Correctness (9/10)
- Precedence is clearly documented and correct: explicit > per-model custom_providers > global model.max_tokens > None. The deliberate inversion (per-model checked before global, opposite of context_length) makes semantic sense — a provider-specific output cap should win.
- Re-resolution at all three seams (switch, fallback, restore) faithfully mirrors the
_config_context_lengthpattern. - Leak guarantees are implemented symmetrically: the global is nulled on switch/fallback (forward), and restore_primary_runtime reapplies the primary's snapshot (backward). Explicit values are gated by
_max_tokens_explicitat every mutation site.
Minor concern: The _raw_per_model_max_tokens() fallback warning-path function in _resolve_max_tokens scans for the first matching base_url and returns early (return inside a for+continue block). If a user has duplicate base_url entries in custom_providers, an invalid value on the second duplicate would never produce a warning. The PR doc explicitly acknowledges this as low-severity and untested. Acceptable for v1, but worth flagging.
Security (10/10)
- No new network exposure, shell execution paths, or unsafe deserialization.
base_urlmatching is trailing-slash-insensitive (safe normalization).boolis properly rejected (it's anintsubclass but never a valid cap — preventsTrue→1token truncation).int()coercion with explicitTypeError/ValueErrorhandling — no silent overflow or injection risk.- No changes to authentication or credential handling.
Code Quality (9/10)
- Excellent docstrings — every new function has a clear docstring describing purpose, precedence, and behavior.
- Inline comments are thorough and explanatory, not redundant.
- Consistent naming:
_resolve_max_tokensparallels_config_context_length;get_custom_provider_max_tokensparallelsget_custom_provider_context_length. - Defensive programming:
getattr(agent, "_custom_providers", None) or [],rt.get()tolerance on restore,_ra().logger.warning+print(stderr)dual warning (matching existing pattern). - Diff is minimal and focused — exactly what's needed, no scope creep.
Nitpicks:
_raw_per_model_max_tokens()is defined as a nested function inside_resolve_max_tokens()with a very similar purpose toget_custom_provider_max_tokens(). The duplication is intentional (one is the validation/warning path, the other is the clean resolver), but it would be cleaner to share the lookup logic. The PR doc justifies this: the warning-only path returns on first match rather than full-scanning, which is a slightly different contract. Still, a shared_find_custom_provider_model_cfg(base_url, model, custom_providers)helper could reduce the code duplication.- The
except Exceptioncatch-alls on the resolver calls (e.g.,try: from hermes_cli.config import get_custom_provider_max_tokens ... except Exception: cp_mt = None) are a bit broad. These are lazy imports and config reads —ImportErrorandKeyError/TypeErrorare expected;KeyboardInterruptorSystemExitwould be swallowed. For a cron/code-review context this is fine (existing code does the same pattern), but ideally catch specific exceptions.
Testing (10/10)
- 138 new tests across 3 test files — excellent coverage.
- Unit tests for
get_custom_provider_max_tokens(): matching/non-matching base_url, missing model, numeric-string coercion, zero/negative, non-numeric,boolrejection, trailing-slash (both directions), empty inputs. All parameterized where appropriate. - Precedence tests for
_resolve_max_tokens(): per-model lands, per-model beats global, global fallback, explicit wins, invalid values warn, bare-string model config tolerated. - Lifecycle tests (switch, fallback, restore): explicit survive, global dropped, per-model re-resolved, rollback on failed switch. Comprehensive.
- Clean assertions — tests assert behavior, not implementation. No time/random/flaky dependencies.
- The PR author notes 138/138 tests pass on these files, and the one unrelated failure is an environmental anthropic import error that also fails on baseline.
Performance (10/10)
- Config read is at init time only — no per-request overhead.
- Re-resolution on switch/fallback involves
load_config()(cached by Python YAML loader) plus a simple linear scan ofcustom_providers— typically <1ms. _primary_runtimesnapshot storesmax_tokensas a plain Python int/none field — negligible memory impact on a dataclass that already stores 20+ fields.
Summary
| Category | Score |
|---|---|
| Correctness | 9/10 |
| Security | 10/10 |
| Code Quality | 9/10 |
| Testing | 10/10 |
| Performance | 10/10 |
Decision: APPROVE (minor suggestions only, no blockers)
Suggested Follow-ups (not blocking)
- Consider extracting a shared
_find_custom_provider_model_cfg(base_url, model, custom_providers)helper to eliminate the structural duplication betweenget_custom_provider_max_tokens()and_raw_per_model_max_tokens()/ the context_length warning block. - Narrow the
except Exceptionwrappers to(ImportError, TypeError, KeyError)where feasible. - Track the
base_urlduplicate warning gap as a follow-up issue for completeness.
Closes #15037, #20004, #28046, #28782. Well-structured, well-tested, minimal blast radius.
Honor
custom_providers[].models.<model>.max_tokensend-to-end so a per-model output cap is never silently dropped (truncation) and never leaks onto another model.Problem
A configured
max_tokenshad no effect — output silently truncated on Ollama Cloud / zai / OpenAI-compatible proxies. This is not a hard 4096 floor: the only literal4096fallback on the request path is on the bedrock call (agent.max_tokens or 4096atchat_completion_helpers.py:563, consumed bytransports/bedrock.py:58). The real symptom is that the field was never set, so each provider fell back to its own small default ceiling.Root cause
A propagation gap: the CLI never passed
max_tokens=into theAIAgentconstructor, soagent.max_tokensstayedNone. WithNone, providers applied their own default → truncation. Omittingmax_tokenswhen nothing is configured is the desired behavior — the model's native output ceiling is used (anthropic_adapter.py:2072); the bug was that a configured value never reachedagent.max_tokens. The fix mirrors the existingcontext_lengthresolve/propagate machinery.Solution (4 core files)
get_custom_provider_max_tokens(hermes_cli/config.py:3662), sibling ofget_custom_provider_context_length: trailing-slash-insensitivebase_urlmatch, positive-int only,boolrejected (it is anintsubclass but never a valid cap)._resolve_max_tokens(agent/agent_init.py:136), called atagent_init.py:1452after_custom_providersis resolved (:1394) so the per-model value can win. The old inline global-only block (~:1352) is deleted and subsumed with identical int/bool/<=0validation and warning text — no behavior loss on the global path. Persists_config_max_tokensand_max_tokens_explicit; both keys are seeded in the_primary_runtimebase literal (:1700).agent.max_tokensis the single field consumed across every transport: chat_completionstransports/chat_completions.py:298,486(guardedis not None), anthropicanthropic_adapter.py:2168, bedrocktransports/bedrock.py:58, direct-OpenAI/o1 mapped tomax_completion_tokensrun_agent.py:1206.Precedence
Precedence inversion (deliberate): per-model is checked before the global value — the opposite of
context_length(global-first). A provider-specific output cap is the more-specific intent and must win. Blast radius is contained:context_lengthprecedence is untouched, and_KNOWN_KEYSis unchanged (an entry-levelmax_tokensstill fires the existing "unknown config keys ignored" warning; the supported location ismodels.<model>.max_tokens, nested-only by design).Initial resolve + 3 re-resolution seams
The config-resolved cap is null-and-re-resolved against the active model at each transition, faithfully mirroring
_config_context_length:load_config(); the global never re-applies (no leak). Snapshotted before the try (agent_runtime_helpers.py:1409-1410), nulled inside the try (:1421), re-resolved (:1532-1547), and rolled back on a failed client build (:1505-1513).agent._custom_providers(chat_completion_helpers.py:1138,1148-1158).rt.getreapply of the primary snapshot (agent_runtime_helpers.py:957-959).Leak guarantees (#28782 — both directions)
restore_primary_runtimereapplies the primary's snapshot._max_tokens_explicitat every mutation site (switch null + re-resolve, fallback null + re-resolve, restore reapply) — a user's--max_tokensis never clobbered.Testing
A broader sweep (
test_primary_runtime_restore.py,test_switch_model_context.py+ the two new files) is 42 passed / 1 failed; the one failure istest_primary_runtime_restore.py::TestTryRecoverPrimaryTransport::test_allowed_for_anthropic_direct— an environmentalanthropiclazy-installImportError(anthropic_adapter.py:680) on an anthropic-only path this diff does not touch. It fails on baseline too — not a regression.Coverage
base_url, missing model, numeric-string coercion, zero/negative, non-numeric,boolrejection, trailing-slash (both directions), empty inputs.model:config, both invalid-value warning paths.Known behaviors (disclosed)
_config_context_lengthmachinery: switch_model re-resolves from live config but does not write backagent._custom_providers, so a switch→fallback uses the startup snapshot (fix-both-or-neither, intentionally not diverging here); the fallback path has no try/except rollback (the same pre-existing exposure already present formodel/provider/_config_context_length, not new surface)._raw_per_model_max_tokensreturns on the firstbase_urlmatch rather than full-scanning, so a pathological duplicate-base_urlconfig could miss (never mis-emit) an invalid-value warning. Low severity, untested._primary_runtimeis never serialized (pure in-memory, rebuilt byinit_agenton every start), so thert.get → Nonetolerance on restore is defensive only and never bites a real upgraded session.Issue / PR linkage
Closes #15037, #20004, #28046, #28782
Addresses #20741
Supersedes #28142, #28154, #28786; supersedes the max_tokens slice of #28995
custom_providersmax_tokensnow reachesAIAgent— the exact net-new behavior of this diff), Bug: max_tokens not read from custom_providers per-model config, always defaults to 4096 #28046 (the always-4096 default is verifiably gone), feat: custom_providers should support per-provider max_tokens override #28782 (per-model override incustom_providers+ both leak directions fully owned).max_tokensfrom config.yaml is silently ignored — never propagated to AIAgent, causing output truncation on Ollama Cloud / zai / custom endpoints #20741: its stated root cause is the globalconfig.yamlmodel.max_tokenspath, which this diff preserves (relocates) rather than introduces — that path already propagated on baseline. The diff does fix the custom-endpoint truncation it reports (via the per-model read), but does not exclusively own the issue's global-config framing, so closure is left conservative.boolrejection and switch/fallback/restore re-resolution none of them had); only the max_tokens slice of feat: Config-Runtime Contract Registry (Phase 1) + fix #28046 + fix #28863 #28995 is superseded.Not in scope (tracked separately)
terminal.docker_extra_argsmissing fromgateway/run.py_terminal_env_map(a one-line gateway fix feat: Config-Runtime Contract Registry (Phase 1) + fix #28046 + fix #28863 #28995 happened to bundle). Not included here; still open; will be filed as a separate follow-up. This PR does not claimCloses #28863.