Skip to content

[Bug] 401 authentication errors do not trigger fallback provider in auxiliary_client.py #21165

@longmei-xsl

Description

@longmei-xsl

Bug Description

Symptom: When the primary inference provider returns a 401 (bad/expired API key), compression fails silently and messages are discarded. The conversation continues, but the compression summary logs:

Compression summary failed: 401

Impact: Message history is lost after compaction events when the primary provider has an auth failure. The fallback chain is never attempted.

Root Cause

In agent/auxiliary_client.py, the should_fallback trigger condition in both call_llm() (sync) and async_call_llm() only checks for three error types:

  • _is_payment_error (402 / 429 credits exceeded)
  • _is_connection_error (network failures)
  • _is_rate_limit_error (429 rate limiting)

401 authentication errors are not included. The _is_auth_error() helper function already exists and is used in the refresh logic (it correctly triggers the auth refresh flow before reaching fallback), but it is missing from the should_fallback boolean that gates whether to try an alternative provider.

Reproduction Path

  1. Configure auto as the provider (uses fallback chain)
  2. Primary provider (e.g. MiniMax) has a bad/expired key → returns 401
  3. Auth refresh is attempted → fails (key really is invalid)
  4. Code reaches should_fallback check → 401 is not in the condition → fallback is skipped
  5. Compression fails, messages are discarded

Fix

Add _is_auth_error(first_err) to the should_fallback condition in both sync and async versions.

Sync version (call_llm, ~line 3728):

should_fallback = (
    _is_auth_error(first_err)        # ← add this
    or _is_payment_error(first_err)
    or _is_connection_error(first_err)
    or _is_rate_limit_error(first_err)
)
if _is_auth_error(first_err):
    reason = "auth error"
elif _is_payment_error(first_err):
    reason = "payment error"
elif _is_rate_limit_error(first_err):
    reason = "rate limit"
else:
    reason = "connection error"

Async version (async_call_llm, ~line 4016): identical patch.

Execution order is correct: The auth refresh (Nous-specific or provider credential refresh) runs first (~lines 3649–3700). Fallback is only attempted after refresh fails. Adding _is_auth_error to should_fallback preserves this ordering — refresh is always tried first, and if it fails, fallback kicks in.

Verification

After applying the fix, trigger a compression event (long conversation) and check logs for:

trying fallback
auth error
grep -c "_is_auth_error(first_err)" agent/auxiliary_client.py
# Expected: >= 2 (sync + async versions)

Additional Context

  • _is_auth_error() function exists at line 1774 and correctly identifies 401 errors
  • The function is already used in the refresh logic path (correctly)
  • The gap is only in the fallback trigger condition
  • Fix is a minimal 2-line addition to should_fallback + corresponding reason branch
  • No changes to refresh logic required

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundarea/authAuthentication, OAuth, credential poolscomp/agentCore agent loop, run_agent.py, prompt buildertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions