Skip to content

[Bug]: Fallback chain collapses when API key unavailable in os.environ but present in credential_pool #15914

@MuBeiGe

Description

@MuBeiGe

Bug Report: API key provider fallback chain collapses when env var unavailable but credential_pool has valid key

Bug Description

When the primary model hits rate limits and Hermes falls back to a provider configured via fallback_providers, the fallback can fail with HTTP 401 even when the provider's API key exists in auth.json's credential_pool. This causes the entire fallback chain to collapse, leaving the agent with no working provider.

Root Cause

_resolve_api_key_provider_secret() in hermes_cli/auth.py (L445-475) only reads API keys from os.environ. It does not fall back to credential_pool (auth.json) when the environment variable is absent.

However, credential_pool is a legitimate runtime key source — runtime_provider.py already uses it via the load_pool()pool.select() path (L795-799). The credential_pool is seeded from env vars by _seed_from_env() in agent/credential_pool.py, so it typically contains the same keys. But when .env hasn't been loaded into os.environ (e.g., after adding a key post-session-start, or through entry points that don't fully load .env), the direct resolution path returns empty string, breaking fallback.

Impact

Severity: High — affects any user relying on fallback_providers when the primary model is rate-limited.

Failure chain:

  1. Primary model (e.g., zai/GLM) hits rate limit → triggers fallback
  2. Fallback to provider X (e.g., deepseek) → _resolve_api_key_provider_secret() returns "" → HTTP 401
  3. Fallback to provider Y (e.g., minimax) → may also fail (e.g., incompatible message format) → HTTP 400
  4. All providers exhausted → agent stops

Steps to Reproduce

  1. Configure multiple providers with API keys via hermes model (keys stored in credential_pool)
  2. Set up fallback_providers in config.yaml:
    fallback_providers:
      - model: deepseek-v4-pro
        provider: deepseek
  3. Ensure DEEPSEEK_API_KEY is in ~/.hermes/.env but NOT in the current shell's os.environ (e.g., start a session, then add the key to .env without restarting)
  4. Trigger primary model rate limit → fallback fails with 401

Simpler reproduction: Use the ACP adapter entry point (acp_adapter/entry.py L89), which calls load_hermes_dotenv(hermes_home=hermes_home) without project_env parameter — the only entry point missing it. If provider keys are in the project .env rather than ~/.hermes/.env, they won't be loaded.

Source Code Evidence

_resolve_api_key_provider_secret() (auth.py L445-475) — the only two sources checked:

# Path 1: os.getenv for registered env vars (L462-464)
for env_var in pconfig.api_key_env_vars:
    val = os.getenv(env_var, "").strip()
    if has_usable_secret(val):
        return val, env_var

# Path 2: custom provider key_env (L467-470)  
key_env = pconfig.extra.get("key_env", "") if pconfig.extra else ""
if key_env:
    val = os.getenv(key_env, "").strip()
    if has_usable_secret(val):
        return val, key_env

return "", ""  # ← No credential_pool fallback

Contrast with pool path (runtime_provider.py L795-799):

pool = load_pool(provider) if should_use_pool else None
# ...
entry = pool.select()  # Reads from credential_pool (auth.json)

Related Issues

Suggested Fix

Add credential_pool as a fallback source in _resolve_api_key_provider_secret(), after env var checks:

def _resolve_api_key_provider_secret(
    provider_id: str, pconfig: ProviderConfig
) -> tuple[str, str]:
    # ... existing env var checks ...

    # Fallback: try credential_pool (auth.json) when env var is missing.
    try:
        pool_entries = read_credential_pool(provider_id)
        for entry in pool_entries:
            token = entry.get("access_token", "")
            if has_usable_secret(token):
                return token, f"credential_pool:{entry.get('id', 'unknown')}"
    except Exception:
        pass  # credential_pool unavailable — don't break auth

    return "", ""

This mirrors the pool path's logic while preserving env var priority. The credential_pool already contains keys seeded from the same env vars, so stale key risk is minimal.

Secondary Finding: acp_adapter/entry.py missing project_env

acp_adapter/entry.py L89 is the only entry point that doesn't pass project_env to load_hermes_dotenv():

# acp_adapter/entry.py L89 — MISSING project_env
loaded = load_hermes_dotenv(hermes_home=hermes_home)

# All other entry points include project_env:
# main.py L167:    load_hermes_dotenv(project_env=PROJECT_ROOT / ".env")
# cli.py L84:      load_hermes_dotenv(hermes_home=..., project_env=...)
# gateway/run.py:  load_hermes_dotenv(hermes_home=..., project_env=...)

This means ACP users (VS Code / Zed / JetBrains integration) may not have all env vars loaded if they rely on project-level .env.

Environment

  • Hermes Agent: 0.9.0 (installed from git, commit 283c8fd)
  • OS: Ubuntu 22.04 on WSL2 (Windows 11)
  • Python: 3.12
  • Providers: zai (GLM-5.1 primary), deepseek (fallback), minimax-cn (fallback)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundarea/authAuthentication, OAuth, credential poolscomp/agentCore agent loop, run_agent.py, prompt buildertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions