fix(gateway): propagate credential_pool through /model session overrides (#16678)#16701
fix(gateway): propagate credential_pool through /model session overrides (#16678)#16701briandevans wants to merge 3 commits into
Conversation
…des (NousResearch#16678) When a gateway session uses /model to switch to a different provider, the gateway stored the new provider's model/api_key/base_url/api_mode in `_session_model_overrides[session_key]` but dropped the credential pool returned by `resolve_runtime_provider`. On the next turn, `_apply_session_model_override` updated runtime_kwargs for everything *except* `credential_pool`, so the original provider's pool stayed live. Result: a 429 on the new provider rotated within the *old* provider's pool (whose credentials don't apply to the new endpoint), or — when the old runtime had no pool — skipped rotation entirely and fell through to the configured fallback model instead of rotating to the next credential on the new provider. Fix: - Add `credential_pool` to `ModelSwitchResult` and capture it in `switch_model()` from the runtime resolution on both the explicit- provider path and the same-provider re-resolve path. - Store `credential_pool` on `_session_model_overrides[session_key]` in both /model handler paths (regular switch + picker callback). - `_apply_session_model_override` now also propagates `credential_pool`, with one specific deviation: an explicit `None` overrides the prior pool instead of being skipped, because a switched-to provider may legitimately have no pool and leaving the old one in place would rotate against the wrong provider's credentials. Test plan: - `tests/hermes_cli/test_model_switch_credential_pool.py` — pins that `ModelSwitchResult.credential_pool` carries the resolved pool on explicit-provider switches, same-provider re-resolves, and the no-pool case. - `tests/gateway/test_session_model_override_routing.py` — three new cases verify `_apply_session_model_override` propagates the override pool, replaces an existing pool with `None` when explicitly cleared, and keeps the existing runtime pool when the override predates this field (legacy shape). - Regression-guarded: temporarily reverted the production change and confirmed all 5 new assertions fail with the precise expected AttributeError / pool-identity mismatch on clean origin/main, then restored. - Adjacent: full `tests/hermes_cli/test_model_switch_*.py`, `tests/run_agent/test_switch_model_*.py`, and gateway session/model tests pass (101 + 18). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes a gateway /model provider-switch bug where the target provider’s credential_pool was not propagated, causing 429 recovery to rotate/behave against the previous provider’s credentials.
Changes:
- Extend
ModelSwitchResultto includecredential_pooland populate it fromresolve_runtime_provider()inswitch_model(). - Store
credential_poolin gateway session model overrides for both/modelswitch paths. - Update
_apply_session_model_overrideto propagatecredential_pool, including explicitly overriding withNonewhen provided, plus add regression tests for these cases.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
hermes_cli/model_switch.py |
Adds credential_pool to the model switch result and threads it through runtime resolution paths. |
gateway/run.py |
Persists credential_pool in session overrides and applies it to subsequent runtime kwargs (with explicit-None semantics). |
tests/hermes_cli/test_model_switch_credential_pool.py |
New regression tests asserting credential_pool round-trips through switch_model() across key paths. |
tests/gateway/test_session_model_override_routing.py |
Adds regression tests ensuring session override application propagates/clears/preserves credential_pool correctly. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # CredentialPool for the resolved target provider, when one was loaded. | ||
| # Callers that maintain runtime kwargs (e.g. the gateway session-override | ||
| # store) must propagate this so per-provider rotation continues to work | ||
| # after a /model switch — without it, a 429 on the new provider rotates | ||
| # within the original provider's pool (or skips rotation entirely and | ||
| # falls through to the configured fallback model). See #16678. | ||
| credential_pool: Optional[Any] = None |
There was a problem hiding this comment.
ModelSwitchResult.credential_pool is typed as Optional[Any], but the runtime provider resolver returns a concrete CredentialPool (see hermes_cli/runtime_provider.py importing CredentialPool). Using Any loses type-safety and makes it harder for callers to know what’s expected. Consider typing this as Optional[CredentialPool] (or Optional["CredentialPool"] with a TYPE_CHECKING import) to document the contract and catch misuse.
| # The new provider's credential pool — without | ||
| # this, future-turn rotation on 429 stays bound | ||
| # to the original provider's pool. See #16678. | ||
| "credential_pool": result.credential_pool, | ||
| } |
There was a problem hiding this comment.
These overrides now store credential_pool, which is not a str. However, _session_model_overrides is annotated and documented as Dict[str, Dict[str, str]] (and described as holding only model/provider/api_key/base_url/api_mode). Please update the type annotation (and the comment describing the override shape) to allow credential_pool (e.g., Dict[str, Dict[str, Any]] or a TypedDict with credential_pool: CredentialPool | None). This avoids misleading types and helps static analysis.
Address Copilot review on PR NousResearch#16701: 1. **`ModelSwitchResult.credential_pool` typed as `Optional[Any]`** — the runtime provider resolver returns a concrete `CredentialPool` instance, so the dataclass field is now typed as `Optional["CredentialPool"]` via a `TYPE_CHECKING`-guarded import from `agent.credential_pool`. Keeps the runtime import graph asymmetric (agent.* depends on hermes_cli.*, not the reverse) while surfacing the contract to static analysis. 2. **`_session_model_overrides: Dict[str, Dict[str, str]]`** — the dict now also stores a live `CredentialPool` under the `credential_pool` key, which is not a `str`. Widen the value type to `Dict[str, Any]` (both the class-level annotation and the instance-level assignment in `__init__`) and update the comment to document that `credential_pool` rides alongside the str fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@copilot Both findings addressed in commit Finding 1 ( Finding 2 ( Verified locally: 20/20 tests in |
|
Nice work on this fix! Your approach of adding I independently hit the same bug and submitted PR #19064 before realizing this PR existed. Our fix takes a different approach (resolving the pool from the provider name at each consumer point), but yours is architecturally cleaner. One gap I noticed: this PR only fixes the gateway path (
Both call Would you be willing to add the CLI fix as well? The pattern is the same — resolve the pool from Happy to close #19064 if you'd prefer to own the full fix. Either way, this PR should go first — it's the proper foundation. |
When a gateway or CLI session uses /model to switch providers, the new provider's credential_pool was silently dropped — ModelSwitchResult had no such field, so switch_model() discarded what resolve_runtime_provider returned. The agent was created with credential_pool=None, making _recover_with_credential_pool() a no-op on 429/402 errors. Fix (combining the approach from NousResearch#16701 with CLI coverage): - hermes_cli/model_switch.py: Add credential_pool field to ModelSwitchResult, capture it from resolve_runtime_provider() in both the explicit-provider and same-provider re-resolve paths. - gateway/run.py: Propagate credential_pool from the result into session overrides and in-place agent.switch_model(). Update _session_model_overrides type hint from Dict[str, str] to Dict[str, Any] to accommodate the CredentialPool instance. - cli.py: Both _apply_model_switch_result and _handle_model_switch pass result.credential_pool to agent.switch_model() and update self._credential_pool. - run_agent.py: Accept credential_pool parameter in switch_model() and update self._credential_pool when provided. Based on the analysis in NousResearch#16701 (briandevans). Closes NousResearch#16678.
|
Updated #19064 to adopt your approach from this PR. What I took from yours:
What I added on top:
The result is now 4 files, +23/-48 net lines — cleaner than both our original PRs since all pool resolution happens once in |
…gent.switch_model Widens the credential_pool fix to cover the two CLI /model paths and the AIAgent in-place swap, completing the propagation chain that began with the gateway-only fix in d768cf0. Three sites still dropped the resolved pool when /model was invoked outside the gateway: * `cli.py::_apply_model_switch_result` — model picker → switch_model * `cli.py::_handle_model_switch` — `/model <name>` command → switch_model * `run_agent.py::AIAgent.switch_model` — accepted no `credential_pool` kwarg, so even when callers had the pool there was no way to apply it to the live agent's `_credential_pool` After this change the full chain is covered: ModelSwitchResult carries the pool out of the resolver, both CLI call sites mirror it onto ``self._credential_pool`` and forward it into ``agent.switch_model()``, which applies it under the same ``if credential_pool is not None`` guard the gateway uses. A switch from one pooled provider to another now keeps 429/402 rotation working on the new provider; a switch to a single-key provider preserves the existing pool rather than silently clearing it. Test plan - new `tests/run_agent/test_switch_model_credential_pool.py` (3 cases): swap pool, omit kwarg keeps existing, explicit None preserves existing - existing `tests/hermes_cli/test_model_switch_credential_pool.py` (3) and `tests/gateway/test_session_model_override_routing.py` (5) pass against the wider propagation - regression guard: removing the `if credential_pool is not None` block in run_agent.switch_model fails the swap-pool test; dropping the kwarg from the agent signature fails the call-site at TypeError Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@Cyrene963 thanks for the close read and the gap analysis — you were right that the gateway-only fix was incomplete. Pushed
New regression file Happy for you to close #19064 since this PR now covers the same surface — much appreciated for the heads-up either way. |
When a gateway or CLI session uses /model to switch providers, the new provider's credential_pool was silently dropped — ModelSwitchResult had no such field, so switch_model() discarded what resolve_runtime_provider returned. The agent was created with credential_pool=None, making _recover_with_credential_pool() a no-op on 429/402 errors. Fix (combining the approach from NousResearch#16701 with CLI coverage): - hermes_cli/model_switch.py: Add credential_pool field to ModelSwitchResult, capture it from resolve_runtime_provider() in both the explicit-provider and same-provider re-resolve paths. - gateway/run.py: Propagate credential_pool from the result into session overrides and in-place agent.switch_model(). Update _session_model_overrides type hint from Dict[str, str] to Dict[str, Any] to accommodate the CredentialPool instance. - cli.py: Both _apply_model_switch_result and _handle_model_switch pass result.credential_pool to agent.switch_model() and update self._credential_pool. - run_agent.py: Accept credential_pool parameter in switch_model() and update self._credential_pool when provided. Based on the analysis in NousResearch#16701 (briandevans). Closes NousResearch#16678.
|
Closing to keep the queue clean — 15 days idle and now conflicting across cli.py / run_agent.py / gateway/run.py. Happy to reopen if this is still useful. |
Summary
When a gateway session uses
/modelto switch to a different provider, the new provider'scredential_poolwas dropped — leaving the original provider's pool live across the switch. A 429 on the new provider then either rotated against the old provider's credentials (which don't apply to the new endpoint) or skipped pool rotation entirely and fell through to the configured fallback model.Fixes #16678.
The bug
hermes_cli.runtime_provider.resolve_runtime_provider()returnscredential_poolalongsideapi_key/base_url/api_modefor the resolved provider. Three layers conspired to drop it on/modelswitches:ModelSwitchResulthad nocredential_poolfield, soswitch_model()discarded whatresolve_runtime_providerreturned for the target provider./modelhandlers (regular switch atgateway/run.py:6116, picker callback atgateway/run.py:5988) stored model/provider/api_key/base_url/api_mode on_session_model_overrides[session_key]— but not the pool._apply_session_model_overridepropagated everything exceptcredential_poolintoruntime_kwargs, so on the next turn the runtime still carried the startup-time pool from the original provider.Net effect:
agent._credential_pool(the rotation source for_recover_with_credential_poolatrun_agent.py:5761) belonged to the old provider after a/modelswitch.The fix
hermes_cli/model_switch.py— addcredential_pooltoModelSwitchResult, capture it from the runtime inswitch_model()on both the explicit-provider path and the same-provider re-resolve path.gateway/run.py— storecredential_poolon_session_model_overrides[session_key]in both/modelhandler paths.gateway/run.py—_apply_session_model_overridenow also propagatescredential_pool, with one deviation from the existing skip-if-Nonerule: an explicitNoneoverrides the prior pool instead of being skipped. A switched-to provider may legitimately have no pool (single env-var key, custom local endpoint), and leaving the old pool in place would rotate against the wrong provider's credentials.The legacy override shape (no
credential_poolkey at all) is preserved by theif \"credential_pool\" in override:gate, so any pre-fix override sitting in memory keeps the existing runtime pool rather than getting silently blanked.Test plan
tests/hermes_cli/test_model_switch_credential_pool.py—ModelSwitchResult.credential_poolround-trips through explicit-provider switches, same-provider re-resolves, and the no-pool case.tests/gateway/test_session_model_override_routing.py— three new cases verify_apply_session_model_overridepropagates the override pool, replaces an existing pool withNonewhen explicitly cleared, and keeps the existing runtime pool when the override predates this field.origin/mainwith the precise expectedAttributeError: 'ModelSwitchResult' object has no attribute 'credential_pool'/ pool-identity mismatch, then restored.tests/hermes_cli/test_model_switch_*.py,tests/run_agent/test_switch_model_*.py, and gateway session/model tests pass (101 + 18, all green).Related
Fixes #16678.
Adjacent prior work on the same surface, none of which addressed this gap:
FailoverReason.overloadedcustom_providerscontext_lengthon/modelswitch