Skip to content

fix(auth): distinguish revoked API keys from transient auth errors#25754

Merged
gumadeiras merged 3 commits intoopenclaw:mainfrom
rrenamed:fix/auth-permanent-failover
Feb 26, 2026
Merged

fix(auth): distinguish revoked API keys from transient auth errors#25754
gumadeiras merged 3 commits intoopenclaw:mainfrom
rrenamed:fix/auth-permanent-failover

Conversation

@rrenamed
Copy link
Contributor

@rrenamed rrenamed commented Feb 24, 2026

Summary

  • Problem: When an API key is revoked, resolveFailoverReasonFromError() maps HTTP 401/403 to "auth" with a transient cooldown (max 1h). The key retries forever instead of being permanently disabled.
  • Why it matters: During key rotation, a revoked key causes agents to retry indefinitely with exponential backoff capped at 1 hour, wasting tokens and blocking failover.
  • What changed: Added "auth_permanent" as a new FailoverReason variant. Provider-specific permanent auth error signals (e.g. invalid_api_key, key has been revoked) are detected in the error message and routed through the disabledUntil path (5h default, 24h max — same as billing) instead of the transient cooldownUntil path.
  • What did NOT change (scope boundary): Transient auth errors ("unauthorized", "forbidden", "invalid token", expired tokens) still use the existing "auth" reason with short cooldown. No changes to billing error handling. No changes to UI/display code.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Revoked/deleted API keys are now disabled for 5h (exponential backoff to 24h max) instead of retrying every ≤1h forever.
  • models status will show disabledUntil + disabledReason: "auth_permanent" for affected profiles.
  • Profile recovers automatically after the disable window expires, or immediately on successful use (e.g. after key replacement).

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: any
  • Runtime/container: Node 22+
  • Model/provider: any provider returning 401/403 with permanent auth error messages

Steps

  1. Configure a profile with a revoked API key
  2. Trigger an LLM request that fails with 401 + "invalid_api_key" in the error message
  3. Check usageStats for the profile

Expected

  • Profile gets disabledUntil set (5h+ window), disabledReason: "auth_permanent"
  • Profile does NOT get cooldownUntil (transient path)

Actual

  • Before: profile gets cooldownUntil (max 1h), retries forever
  • After: profile gets disabledUntil (5h default, 24h max), matching billing error behavior

Evidence

  • Failing test/log before + passing after
    • New tests in failover-error.test.ts (6 tests), auth-profiles.markauthprofilefailure.test.ts (1 test), pi-embedded-helpers.isbillingerrormessage.test.ts (3 test blocks)
    • 164 tests across 7 agent-area test files pass
    • pnpm check (oxfmt + tsgo + oxlint) clean

Human Verification (required)

  • Verified scenarios: backward compat (bare 401/403 still returns "auth"), permanent auth detection from status+message, pattern separation ("invalid api key" → auth vs "invalid_api_key" → auth_permanent), disabledUntil path for auth_permanent, clearExpiredCooldowns handles auth_permanent generically, markAuthProfileUsed clears disabled state on success
  • Edge cases checked: empty message on 401/403 (falls through to "auth"), FAILURE_REASON_SET/FAILURE_REASON_ORDER auto-updated, as const type inference, pi-embedded-runner/run.ts propagation, list.probe.ts/list.status-command.ts fallthrough behavior
  • What you did not verify: live end-to-end with a real revoked key (would need a deliberately revoked provider key to trigger the 401 + "invalid_api_key" flow in a running instance)

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert this single commit; the "auth_permanent" reason will simply never be produced, falling back to "auth" transient cooldown
  • Files/config to restore: none (no config changes)
  • Known bad symptoms reviewers should watch for: profiles being permanently disabled when they shouldn't be (false positive on the authPermanent patterns), or profiles NOT being disabled when they should be (pattern too narrow)

Risks and Mitigations

  • Risk: authPermanent patterns could false-positive on messages that contain "invalid_api_key" as substring in a longer non-permanent error
    • Mitigation: The patterns are specific (exact string "invalid_api_key", regexes requiring revoked|deactivated|deleted after api key). Transient auth keywords ("unauthorized", "forbidden", "invalid token") are NOT in the permanent list. The permanent check runs only when HTTP status is already 401/403 or when message-only classification is used.

Greptile Summary

This PR adds a new "auth_permanent" failover reason to distinguish permanently revoked/invalid API keys from transient auth errors. Revoked keys now trigger the longer disabledUntil path (5h default, 24h max) instead of the shorter cooldownUntil path (max 1h), preventing indefinite retry loops during key rotation.

The implementation follows existing patterns:

  • Pattern detection checks authPermanent before auth in classifyFailoverReason() to handle overlapping patterns correctly
  • auth_permanent reuses the billing backoff logic (disabledUntil + exponential backoff)
  • FAILURE_REASON_PRIORITY places auth_permanent at highest priority (before auth)
  • Test coverage includes pattern matching, backoff behavior, and backward compatibility

Confidence Score: 5/5

  • Safe to merge with high confidence
  • The change is well-contained with comprehensive test coverage (10 new tests), follows existing patterns for failover handling, maintains backward compatibility (bare 401/403 still returns "auth"), and includes proper ordering to handle pattern overlaps. The implementation correctly reuses billing backoff logic for permanent auth failures.
  • No files require special attention

Last reviewed commit: c66546c

@openclaw-barnacle openclaw-barnacle bot added agents Agent runtime and tooling size: S labels Feb 24, 2026
@rrenamed
Copy link
Contributor Author

rrenamed commented Feb 24, 2026

Design note: why auth_permanent uses time-bounded disabledUntil (not true permanent disable)

Went through 13 related issues before landing on this approach. The instinct is to permanently kill a revoked key so it'll never come back, so why retry? Two reasons we shouldn't do that yet:

Misclassification is a real problem in this codebase. Five separate issues document auth errors being wrongly classified — OAuth token expiry tagged as billing (#18624), OAuth refresh path unreachable so tokens enter permanent cooldown (#17873), hardcoded rate_limit on what's actually an OAuth error (#23996, #13909), server 500/503 tagged as rate_limit (#22294). OAuth Access Tokens (sk-ant-oat01-*) expire every ~8h and return 401 when they do. If our patterns were even slightly too broad, we'd permanently disable every OAuth user after 8 hours. We kept the patterns narrow (invalid_api_key, key has been revoked, etc.) but narrow isn't zero-risk, and permanent disable amplifies every misclassification into a hard lockout.

There's no CLI recovery path. No openclaw provider reset command exists (#21574). The only way to clear disabledUntil today is manually editing auth-profiles.json. Shipping true permanent disable without an escape hatch felt wrong — that's exactly the complaint in #21574 and #13909.

So instead: auth_permanent reuses the billing disabledUntil path (5h default → 24h max). A revoked key retries once a day, fails, gets another 24h cooldown. Effectively permanent, but self-healing if we got the classification wrong. And disabledReason: "auth_permanent" in openclaw models status gives clear visibility into what happened.

True permanent disable makes sense as a follow-up once there's a recovery command and the OAuth refresh path (#17873) is fixed.

Related: #18624, #17873, #23996, #13909, #21574, #23815, #16668, #20316, #22294, #14376, #23317

@gumadeiras gumadeiras self-assigned this Feb 26, 2026
@gumadeiras gumadeiras force-pushed the fix/auth-permanent-failover branch from c66546c to aadd7cf Compare February 26, 2026 00:24
@openclaw-barnacle openclaw-barnacle bot added the commands Command implementations label Feb 26, 2026
@gumadeiras gumadeiras force-pushed the fix/auth-permanent-failover branch from aadd7cf to cb1b9b4 Compare February 26, 2026 00:25
@gumadeiras gumadeiras force-pushed the fix/auth-permanent-failover branch from ec3fe7b to 8f9c07a Compare February 26, 2026 00:46
@gumadeiras gumadeiras merged commit c002627 into openclaw:main Feb 26, 2026
8 checks passed
@gumadeiras
Copy link
Member

Merged via squash.

Thanks @rrenamed!

brianleach pushed a commit to brianleach/openclaw that referenced this pull request Feb 26, 2026
…penclaw#25754)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 8f9c07a
Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
execute008 pushed a commit to execute008/openclaw that referenced this pull request Feb 27, 2026
…penclaw#25754)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 8f9c07a
Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
r4jiv007 pushed a commit to r4jiv007/openclaw that referenced this pull request Feb 28, 2026
…penclaw#25754)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 8f9c07a
Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
vincentkoc pushed a commit to Sid-Qin/openclaw that referenced this pull request Feb 28, 2026
…penclaw#25754)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 8f9c07a
Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
vincentkoc pushed a commit to rylena/rylen-openclaw that referenced this pull request Feb 28, 2026
…penclaw#25754)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 8f9c07a
Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
steipete pushed a commit to Sid-Qin/openclaw that referenced this pull request Mar 2, 2026
…penclaw#25754)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 8f9c07a
Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
dorgonman pushed a commit to kanohorizonia/openclaw that referenced this pull request Mar 3, 2026
…penclaw#25754)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 8f9c07a
Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
zooqueen pushed a commit to hanzoai/bot that referenced this pull request Mar 6, 2026
…penclaw#25754)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 8f9c07a
Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
thebenjaminlee pushed a commit to escape-velocity-ventures/openclaw that referenced this pull request Mar 7, 2026
…penclaw#25754)

Merged via /review-pr -> /prepare-pr -> /merge-pr.

Prepared head SHA: 8f9c07a
Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling commands Command implementations size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Auth failover: distinguish revoked tokens from transient auth errors

2 participants