Bug Report: API key provider fallback chain collapses when env var unavailable but credential_pool has valid key
Bug Description
When the primary model hits rate limits and Hermes falls back to a provider configured via fallback_providers, the fallback can fail with HTTP 401 even when the provider's API key exists in auth.json's credential_pool. This causes the entire fallback chain to collapse, leaving the agent with no working provider.
Root Cause
_resolve_api_key_provider_secret() in hermes_cli/auth.py (L445-475) only reads API keys from os.environ. It does not fall back to credential_pool (auth.json) when the environment variable is absent.
However, credential_pool is a legitimate runtime key source — runtime_provider.py already uses it via the load_pool() → pool.select() path (L795-799). The credential_pool is seeded from env vars by _seed_from_env() in agent/credential_pool.py, so it typically contains the same keys. But when .env hasn't been loaded into os.environ (e.g., after adding a key post-session-start, or through entry points that don't fully load .env), the direct resolution path returns empty string, breaking fallback.
Impact
Severity: High — affects any user relying on fallback_providers when the primary model is rate-limited.
Failure chain:
- Primary model (e.g., zai/GLM) hits rate limit → triggers fallback
- Fallback to provider X (e.g., deepseek) →
_resolve_api_key_provider_secret() returns "" → HTTP 401
- Fallback to provider Y (e.g., minimax) → may also fail (e.g., incompatible message format) → HTTP 400
- All providers exhausted → agent stops
Steps to Reproduce
- Configure multiple providers with API keys via
hermes model (keys stored in credential_pool)
- Set up
fallback_providers in config.yaml:
fallback_providers:
- model: deepseek-v4-pro
provider: deepseek
- Ensure
DEEPSEEK_API_KEY is in ~/.hermes/.env but NOT in the current shell's os.environ (e.g., start a session, then add the key to .env without restarting)
- Trigger primary model rate limit → fallback fails with 401
Simpler reproduction: Use the ACP adapter entry point (acp_adapter/entry.py L89), which calls load_hermes_dotenv(hermes_home=hermes_home) without project_env parameter — the only entry point missing it. If provider keys are in the project .env rather than ~/.hermes/.env, they won't be loaded.
Source Code Evidence
_resolve_api_key_provider_secret() (auth.py L445-475) — the only two sources checked:
# Path 1: os.getenv for registered env vars (L462-464)
for env_var in pconfig.api_key_env_vars:
val = os.getenv(env_var, "").strip()
if has_usable_secret(val):
return val, env_var
# Path 2: custom provider key_env (L467-470)
key_env = pconfig.extra.get("key_env", "") if pconfig.extra else ""
if key_env:
val = os.getenv(key_env, "").strip()
if has_usable_secret(val):
return val, key_env
return "", "" # ← No credential_pool fallback
Contrast with pool path (runtime_provider.py L795-799):
pool = load_pool(provider) if should_use_pool else None
# ...
entry = pool.select() # Reads from credential_pool (auth.json)
Related Issues
Suggested Fix
Add credential_pool as a fallback source in _resolve_api_key_provider_secret(), after env var checks:
def _resolve_api_key_provider_secret(
provider_id: str, pconfig: ProviderConfig
) -> tuple[str, str]:
# ... existing env var checks ...
# Fallback: try credential_pool (auth.json) when env var is missing.
try:
pool_entries = read_credential_pool(provider_id)
for entry in pool_entries:
token = entry.get("access_token", "")
if has_usable_secret(token):
return token, f"credential_pool:{entry.get('id', 'unknown')}"
except Exception:
pass # credential_pool unavailable — don't break auth
return "", ""
This mirrors the pool path's logic while preserving env var priority. The credential_pool already contains keys seeded from the same env vars, so stale key risk is minimal.
Secondary Finding: acp_adapter/entry.py missing project_env
acp_adapter/entry.py L89 is the only entry point that doesn't pass project_env to load_hermes_dotenv():
# acp_adapter/entry.py L89 — MISSING project_env
loaded = load_hermes_dotenv(hermes_home=hermes_home)
# All other entry points include project_env:
# main.py L167: load_hermes_dotenv(project_env=PROJECT_ROOT / ".env")
# cli.py L84: load_hermes_dotenv(hermes_home=..., project_env=...)
# gateway/run.py: load_hermes_dotenv(hermes_home=..., project_env=...)
This means ACP users (VS Code / Zed / JetBrains integration) may not have all env vars loaded if they rely on project-level .env.
Environment
- Hermes Agent: 0.9.0 (installed from git, commit 283c8fd)
- OS: Ubuntu 22.04 on WSL2 (Windows 11)
- Python: 3.12
- Providers: zai (GLM-5.1 primary), deepseek (fallback), minimax-cn (fallback)
Bug Report: API key provider fallback chain collapses when env var unavailable but credential_pool has valid key
Bug Description
When the primary model hits rate limits and Hermes falls back to a provider configured via
fallback_providers, the fallback can fail with HTTP 401 even when the provider's API key exists inauth.json'scredential_pool. This causes the entire fallback chain to collapse, leaving the agent with no working provider.Root Cause
_resolve_api_key_provider_secret()inhermes_cli/auth.py(L445-475) only reads API keys fromos.environ. It does not fall back tocredential_pool(auth.json) when the environment variable is absent.However,
credential_poolis a legitimate runtime key source —runtime_provider.pyalready uses it via theload_pool()→pool.select()path (L795-799). Thecredential_poolis seeded from env vars by_seed_from_env()inagent/credential_pool.py, so it typically contains the same keys. But when.envhasn't been loaded intoos.environ(e.g., after adding a key post-session-start, or through entry points that don't fully load.env), the direct resolution path returns empty string, breaking fallback.Impact
Severity: High — affects any user relying on
fallback_providerswhen the primary model is rate-limited.Failure chain:
_resolve_api_key_provider_secret()returns""→ HTTP 401Steps to Reproduce
hermes model(keys stored incredential_pool)fallback_providersinconfig.yaml:DEEPSEEK_API_KEYis in~/.hermes/.envbut NOT in the current shell'sos.environ(e.g., start a session, then add the key to.envwithout restarting)Simpler reproduction: Use the ACP adapter entry point (
acp_adapter/entry.pyL89), which callsload_hermes_dotenv(hermes_home=hermes_home)withoutproject_envparameter — the only entry point missing it. If provider keys are in the project.envrather than~/.hermes/.env, they won't be loaded.Source Code Evidence
_resolve_api_key_provider_secret()(auth.py L445-475) — the only two sources checked:Contrast with pool path (runtime_provider.py L795-799):
Related Issues
_build_cheap_route()missingcredential_poolkey in runtime dict — same design gapfallback_modelwithprovider: customignoresapi_keyfrom config — same symptom familybase_urlresolution gap — same areaSuggested Fix
Add credential_pool as a fallback source in
_resolve_api_key_provider_secret(), after env var checks:This mirrors the pool path's logic while preserving env var priority. The
credential_poolalready contains keys seeded from the same env vars, so stale key risk is minimal.Secondary Finding: acp_adapter/entry.py missing project_env
acp_adapter/entry.pyL89 is the only entry point that doesn't passproject_envtoload_hermes_dotenv():This means ACP users (VS Code / Zed / JetBrains integration) may not have all env vars loaded if they rely on project-level
.env.Environment