Skip to content

fix(gateway): propagate credential_pool through /model session overrides (#16678)#16701

Closed
briandevans wants to merge 3 commits into
NousResearch:mainfrom
briandevans:fix/credential-pool-model-switch-16678
Closed

fix(gateway): propagate credential_pool through /model session overrides (#16678)#16701
briandevans wants to merge 3 commits into
NousResearch:mainfrom
briandevans:fix/credential-pool-model-switch-16678

Conversation

@briandevans

Copy link
Copy Markdown
Contributor

Summary

When a gateway session uses /model to switch to a different provider, the new provider's credential_pool was dropped — leaving the original provider's pool live across the switch. A 429 on the new provider then either rotated against the old provider's credentials (which don't apply to the new endpoint) or skipped pool rotation entirely and fell through to the configured fallback model.

Fixes #16678.

The bug

hermes_cli.runtime_provider.resolve_runtime_provider() returns credential_pool alongside api_key / base_url / api_mode for the resolved provider. Three layers conspired to drop it on /model switches:

  1. ModelSwitchResult had no credential_pool field, so switch_model() discarded what resolve_runtime_provider returned for the target provider.
  2. The gateway's two /model handlers (regular switch at gateway/run.py:6116, picker callback at gateway/run.py:5988) stored model/provider/api_key/base_url/api_mode on _session_model_overrides[session_key] — but not the pool.
  3. _apply_session_model_override propagated everything except credential_pool into runtime_kwargs, so on the next turn the runtime still carried the startup-time pool from the original provider.

Net effect: agent._credential_pool (the rotation source for _recover_with_credential_pool at run_agent.py:5761) belonged to the old provider after a /model switch.

The fix

  • hermes_cli/model_switch.py — add credential_pool to ModelSwitchResult, capture it from the runtime in switch_model() on both the explicit-provider path and the same-provider re-resolve path.
  • gateway/run.py — store credential_pool on _session_model_overrides[session_key] in both /model handler paths.
  • gateway/run.py_apply_session_model_override now also propagates credential_pool, with one deviation from the existing skip-if-None rule: an explicit None overrides the prior pool instead of being skipped. A switched-to provider may legitimately have no pool (single env-var key, custom local endpoint), and leaving the old pool in place would rotate against the wrong provider's credentials.

The legacy override shape (no credential_pool key at all) is preserved by the if \"credential_pool\" in override: gate, so any pre-fix override sitting in memory keeps the existing runtime pool rather than getting silently blanked.

Test plan

  • tests/hermes_cli/test_model_switch_credential_pool.pyModelSwitchResult.credential_pool round-trips through explicit-provider switches, same-provider re-resolves, and the no-pool case.
  • tests/gateway/test_session_model_override_routing.py — three new cases verify _apply_session_model_override propagates the override pool, replaces an existing pool with None when explicitly cleared, and keeps the existing runtime pool when the override predates this field.
  • Regression guard: temporarily reverted the production change and confirmed all 5 new assertions fail on clean origin/main with the precise expected AttributeError: 'ModelSwitchResult' object has no attribute 'credential_pool' / pool-identity mismatch, then restored.
  • Adjacent suites: full tests/hermes_cli/test_model_switch_*.py, tests/run_agent/test_switch_model_*.py, and gateway session/model tests pass (101 + 18, all green).

Related

Fixes #16678.

Adjacent prior work on the same surface, none of which addressed this gap:

…des (NousResearch#16678)

When a gateway session uses /model to switch to a different provider,
the gateway stored the new provider's model/api_key/base_url/api_mode
in `_session_model_overrides[session_key]` but dropped the credential
pool returned by `resolve_runtime_provider`.  On the next turn,
`_apply_session_model_override` updated runtime_kwargs for everything
*except* `credential_pool`, so the original provider's pool stayed live.

Result: a 429 on the new provider rotated within the *old* provider's
pool (whose credentials don't apply to the new endpoint), or — when
the old runtime had no pool — skipped rotation entirely and fell
through to the configured fallback model instead of rotating to the
next credential on the new provider.

Fix:

- Add `credential_pool` to `ModelSwitchResult` and capture it in
  `switch_model()` from the runtime resolution on both the explicit-
  provider path and the same-provider re-resolve path.
- Store `credential_pool` on `_session_model_overrides[session_key]`
  in both /model handler paths (regular switch + picker callback).
- `_apply_session_model_override` now also propagates `credential_pool`,
  with one specific deviation: an explicit `None` overrides the prior
  pool instead of being skipped, because a switched-to provider may
  legitimately have no pool and leaving the old one in place would
  rotate against the wrong provider's credentials.

Test plan:

- `tests/hermes_cli/test_model_switch_credential_pool.py` — pins that
  `ModelSwitchResult.credential_pool` carries the resolved pool on
  explicit-provider switches, same-provider re-resolves, and the
  no-pool case.
- `tests/gateway/test_session_model_override_routing.py` — three new
  cases verify `_apply_session_model_override` propagates the override
  pool, replaces an existing pool with `None` when explicitly cleared,
  and keeps the existing runtime pool when the override predates this
  field (legacy shape).
- Regression-guarded: temporarily reverted the production change and
  confirmed all 5 new assertions fail with the precise expected
  AttributeError / pool-identity mismatch on clean origin/main, then
  restored.
- Adjacent: full `tests/hermes_cli/test_model_switch_*.py`,
  `tests/run_agent/test_switch_model_*.py`, and gateway session/model
  tests pass (101 + 18).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 20:13

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a gateway /model provider-switch bug where the target provider’s credential_pool was not propagated, causing 429 recovery to rotate/behave against the previous provider’s credentials.

Changes:

  • Extend ModelSwitchResult to include credential_pool and populate it from resolve_runtime_provider() in switch_model().
  • Store credential_pool in gateway session model overrides for both /model switch paths.
  • Update _apply_session_model_override to propagate credential_pool, including explicitly overriding with None when provided, plus add regression tests for these cases.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
hermes_cli/model_switch.py Adds credential_pool to the model switch result and threads it through runtime resolution paths.
gateway/run.py Persists credential_pool in session overrides and applies it to subsequent runtime kwargs (with explicit-None semantics).
tests/hermes_cli/test_model_switch_credential_pool.py New regression tests asserting credential_pool round-trips through switch_model() across key paths.
tests/gateway/test_session_model_override_routing.py Adds regression tests ensuring session override application propagates/clears/preserves credential_pool correctly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread hermes_cli/model_switch.py Outdated
Comment on lines +244 to +250
# CredentialPool for the resolved target provider, when one was loaded.
# Callers that maintain runtime kwargs (e.g. the gateway session-override
# store) must propagate this so per-provider rotation continues to work
# after a /model switch — without it, a 429 on the new provider rotates
# within the original provider's pool (or skips rotation entirely and
# falls through to the configured fallback model). See #16678.
credential_pool: Optional[Any] = None

Copilot AI Apr 27, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModelSwitchResult.credential_pool is typed as Optional[Any], but the runtime provider resolver returns a concrete CredentialPool (see hermes_cli/runtime_provider.py importing CredentialPool). Using Any loses type-safety and makes it harder for callers to know what’s expected. Consider typing this as Optional[CredentialPool] (or Optional["CredentialPool"] with a TYPE_CHECKING import) to document the contract and catch misuse.

Copilot uses AI. Check for mistakes.
Comment thread gateway/run.py
Comment on lines +5994 to 5998
# The new provider's credential pool — without
# this, future-turn rotation on 429 stays bound
# to the original provider's pool. See #16678.
"credential_pool": result.credential_pool,
}

Copilot AI Apr 27, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These overrides now store credential_pool, which is not a str. However, _session_model_overrides is annotated and documented as Dict[str, Dict[str, str]] (and described as holding only model/provider/api_key/base_url/api_mode). Please update the type annotation (and the comment describing the override shape) to allow credential_pool (e.g., Dict[str, Dict[str, Any]] or a TypedDict with credential_pool: CredentialPool | None). This avoids misleading types and helps static analysis.

Copilot uses AI. Check for mistakes.
Address Copilot review on PR NousResearch#16701:

1. **`ModelSwitchResult.credential_pool` typed as `Optional[Any]`** —
   the runtime provider resolver returns a concrete `CredentialPool`
   instance, so the dataclass field is now typed as
   `Optional["CredentialPool"]` via a `TYPE_CHECKING`-guarded import
   from `agent.credential_pool`.  Keeps the runtime import graph
   asymmetric (agent.* depends on hermes_cli.*, not the reverse) while
   surfacing the contract to static analysis.

2. **`_session_model_overrides: Dict[str, Dict[str, str]]`** —
   the dict now also stores a live `CredentialPool` under the
   `credential_pool` key, which is not a `str`.  Widen the value type
   to `Dict[str, Any]` (both the class-level annotation and the
   instance-level assignment in `__init__`) and update the comment to
   document that `credential_pool` rides alongside the str fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@briandevans

Copy link
Copy Markdown
Contributor Author

@copilot Both findings addressed in commit e7eb48e06:

Finding 1 (hermes_cli/model_switch.py:250Optional[Any] loses type-safety):
Changed credential_pool: Optional[Any]credential_pool: Optional["CredentialPool"] and added a TYPE_CHECKING-guarded from agent.credential_pool import CredentialPool import. The string-literal annotation keeps the runtime import graph asymmetric (agent.* already depends on hermes_cli.*; reversing that at runtime would risk circular imports) while still letting type checkers resolve the contract.

Finding 2 (gateway/run.py:5998Dict[str, Dict[str, str]] annotation no longer matches contents):
Widened both the class-level (gateway/run.py:640) and instance-level (gateway/run.py:716) annotations from Dict[str, Dict[str, str]]Dict[str, Dict[str, Any]], and updated the comments to document that credential_pool rides alongside the str-only model/provider/api_key/base_url/api_mode fields.

Verified locally: 20/20 tests in tests/gateway/test_session_model_override_routing.py, tests/gateway/test_model_switch_persistence.py, tests/gateway/test_session_model_reset.py, and tests/hermes_cli/test_model_switch_credential_pool.py pass; 86 broader hermes_cli model_switch + credential_pool tests also pass.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery comp/cli CLI entry point, hermes_cli/, setup wizard area/auth Authentication, OAuth, credential pools labels Apr 27, 2026
@Cyrene963

Copy link
Copy Markdown

Nice work on this fix! Your approach of adding credential_pool to ModelSwitchResult is the right way to solve it — fixing at the source rather than patching every consumer.

I independently hit the same bug and submitted PR #19064 before realizing this PR existed. Our fix takes a different approach (resolving the pool from the provider name at each consumer point), but yours is architecturally cleaner.

One gap I noticed: this PR only fixes the gateway path (gateway/run.py). The CLI has the same bug in two places:

  • cli.py:5476 (_apply_model_switch_result) — called when the user picks a model from the interactive picker
  • cli.py:5700 (_handle_model_switch) — the main /model command handler in CLI mode

Both call agent.switch_model() without passing the pool, so a CLI /model switch to a custom:<name> provider also loses rotation.

Would you be willing to add the CLI fix as well? The pattern is the same — resolve the pool from result.target_provider and pass it through. Alternatively, I can rebase #19064 after yours merges to cover just the CLI + run_agent.py parts.

Happy to close #19064 if you'd prefer to own the full fix. Either way, this PR should go first — it's the proper foundation.

Cyrene963 pushed a commit to Cyrene963/hermes-agent that referenced this pull request May 3, 2026
When a gateway or CLI session uses /model to switch providers, the new
provider's credential_pool was silently dropped — ModelSwitchResult had
no such field, so switch_model() discarded what resolve_runtime_provider
returned.  The agent was created with credential_pool=None, making
_recover_with_credential_pool() a no-op on 429/402 errors.

Fix (combining the approach from NousResearch#16701 with CLI coverage):

- hermes_cli/model_switch.py: Add credential_pool field to
  ModelSwitchResult, capture it from resolve_runtime_provider() in
  both the explicit-provider and same-provider re-resolve paths.

- gateway/run.py: Propagate credential_pool from the result into
  session overrides and in-place agent.switch_model().  Update
  _session_model_overrides type hint from Dict[str, str] to
  Dict[str, Any] to accommodate the CredentialPool instance.

- cli.py: Both _apply_model_switch_result and _handle_model_switch
  pass result.credential_pool to agent.switch_model() and update
  self._credential_pool.

- run_agent.py: Accept credential_pool parameter in switch_model()
  and update self._credential_pool when provided.

Based on the analysis in NousResearch#16701 (briandevans).  Closes NousResearch#16678.
@Cyrene963

Copy link
Copy Markdown

Updated #19064 to adopt your approach from this PR.

What I took from yours:

  • Added credential_pool to ModelSwitchResult — the proper source-level fix
  • Captured pool from resolve_runtime_provider() on both explicit-provider and same-provider re-resolve paths
  • Updated _session_model_overrides type hint to Dict[str, Any]

What I added on top:

  • CLI path coverage: Your PR only fixed the gateway. CLI's _apply_model_switch_result and _handle_model_switch had the same bug — both call agent.switch_model() without passing the pool. Now they use result.credential_pool.
  • run_agent.py: Added credential_pool parameter to AIAgent.switch_model() so the agent's _credential_pool is updated on in-place swaps.

The result is now 4 files, +23/-48 net lines — cleaner than both our original PRs since all pool resolution happens once in model_switch.py rather than at every consumer. Happy to rebase or merge first.

…gent.switch_model

Widens the credential_pool fix to cover the two CLI /model paths and the
AIAgent in-place swap, completing the propagation chain that began with the
gateway-only fix in d768cf0.

Three sites still dropped the resolved pool when /model was invoked outside
the gateway:

  * `cli.py::_apply_model_switch_result` — model picker → switch_model
  * `cli.py::_handle_model_switch` — `/model <name>` command → switch_model
  * `run_agent.py::AIAgent.switch_model` — accepted no `credential_pool`
    kwarg, so even when callers had the pool there was no way to apply it
    to the live agent's `_credential_pool`

After this change the full chain is covered: ModelSwitchResult carries the
pool out of the resolver, both CLI call sites mirror it onto
``self._credential_pool`` and forward it into ``agent.switch_model()``,
which applies it under the same ``if credential_pool is not None`` guard
the gateway uses.  A switch from one pooled provider to another now keeps
429/402 rotation working on the new provider; a switch to a single-key
provider preserves the existing pool rather than silently clearing it.

Test plan
- new `tests/run_agent/test_switch_model_credential_pool.py` (3 cases):
  swap pool, omit kwarg keeps existing, explicit None preserves existing
- existing `tests/hermes_cli/test_model_switch_credential_pool.py` (3) and
  `tests/gateway/test_session_model_override_routing.py` (5) pass against
  the wider propagation
- regression guard: removing the `if credential_pool is not None` block in
  run_agent.switch_model fails the swap-pool test; dropping the kwarg from
  the agent signature fails the call-site at TypeError

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@briandevans

Copy link
Copy Markdown
Contributor Author

@Cyrene963 thanks for the close read and the gap analysis — you were right that the gateway-only fix was incomplete. Pushed 65be58086 to widen this PR to cover both CLI paths and AIAgent.switch_model:

  • cli.py::_apply_model_switch_result (interactive picker) — passes result.credential_pool to agent.switch_model() and mirrors it onto self._credential_pool
  • cli.py::_handle_model_switch (/model <name> command) — same pattern
  • run_agent.py::AIAgent.switch_model — accepts credential_pool=None kwarg under the same if credential_pool is not None guard the gateway uses, so a switch to a single-key provider preserves the existing pool rather than silently clearing it

New regression file tests/run_agent/test_switch_model_credential_pool.py (3 cases) pins the agent-side contract: swap pool / omit kwarg keeps existing / explicit None preserves existing. Combined with the existing tests/hermes_cli/test_model_switch_credential_pool.py (3 cases on the resolver side) and tests/gateway/test_session_model_override_routing.py (5 cases on the override side), the full chain is now contract-tested end-to-end.

Happy for you to close #19064 since this PR now covers the same surface — much appreciated for the heads-up either way.

Cyrene963 pushed a commit to Cyrene963/hermes-agent that referenced this pull request May 3, 2026
When a gateway or CLI session uses /model to switch providers, the new
provider's credential_pool was silently dropped — ModelSwitchResult had
no such field, so switch_model() discarded what resolve_runtime_provider
returned.  The agent was created with credential_pool=None, making
_recover_with_credential_pool() a no-op on 429/402 errors.

Fix (combining the approach from NousResearch#16701 with CLI coverage):

- hermes_cli/model_switch.py: Add credential_pool field to
  ModelSwitchResult, capture it from resolve_runtime_provider() in
  both the explicit-provider and same-provider re-resolve paths.

- gateway/run.py: Propagate credential_pool from the result into
  session overrides and in-place agent.switch_model().  Update
  _session_model_overrides type hint from Dict[str, str] to
  Dict[str, Any] to accommodate the CredentialPool instance.

- cli.py: Both _apply_model_switch_result and _handle_model_switch
  pass result.credential_pool to agent.switch_model() and update
  self._credential_pool.

- run_agent.py: Accept credential_pool parameter in switch_model()
  and update self._credential_pool when provided.

Based on the analysis in NousResearch#16701 (briandevans).  Closes NousResearch#16678.
@briandevans

Copy link
Copy Markdown
Contributor Author

Closing to keep the queue clean — 15 days idle and now conflicting across cli.py / run_agent.py / gateway/run.py. Happy to reopen if this is still useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/auth Authentication, OAuth, credential pools comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Credential pool not loaded when /model switches provider in gateway — causes fallback instead of rotation on 429

4 participants