Skip to content

fix(oneshot): honor fallback_providers chain during worker startup, not just runtime#9

Merged
jarvis-stark-ops merged 1 commit into
mainfrom
wt/xai-auth-tolerance
Jun 8, 2026
Merged

fix(oneshot): honor fallback_providers chain during worker startup, not just runtime#9
jarvis-stark-ops merged 1 commit into
mainfrom
wt/xai-auth-tolerance

Conversation

@jarvis-stark-ops

Copy link
Copy Markdown
Collaborator

Summary

Why this matters

v6.6 incident (2026-06-07): xAI OAuth token went missing mid-session and every subsequent worker crashed at startup, despite each profile having fallback_providers: [openai-codex/gpt-5.5, xai-oauth/grok-4.3] configured. The fallback chain was only honored AFTER successful credential resolution (AIAgent's runtime loop), not DURING initial worker startup. This fixes that.

Combined with #7 (now merged — dispatcher heartbeat) and #5 (still open — 429 exit-code mapping), the worker-startup → dispatcher-detect → next-retry loop is now operationally robust.

Test plan (9/9 passing)

  • Primary succeeds → no fallback attempted, full chain preserved for AIAgent
  • Primary fails, first fallback succeeds → effective_model advances, remaining_chain sliced
  • Two failures → third succeeds, slicing correct
  • All fail → LAST AuthError propagates (not primary's)
  • Empty chain → primary error verbatim
  • Fallback without explicit model → effective_model preserved
  • explicit_pin=True (CLI --model/--provider) → no fallback, primary error verbatim
  • Rate-limit AuthError → no fallback (quota amplification footgun), let rate-limit retry handle
  • Same provider in chain → no infinite loop, advances to next entry
  • Manual: live-test with one Marvel team profile that has the chain configured, kill xAI auth, confirm worker recovers via Codex/Anthropic

Code-review focus

  1. Safety bound: explicit CLI pinhermes -z --model X --provider Y should NOT silently downgrade. New explicit_pin: bool parameter, derived from CLI args (not env vars, not config).
  2. Safety bound: rate-limit on primary — re-uses existing is_rate_limited_auth_error() from auth.py.
  3. Safety bound: remaining-chain handoff — helper now returns (runtime, model, landed_idx, remaining_chain); caller passes remaining to AIAgent so its runtime loop doesn't re-try the dead primary.
  4. Hoisted get_fallback_chain(cfg) to one local — was being called twice (TOCTOU on mutable cfg).

Follow-up (separate issues, not blocking)

  • Consider applying the same pattern to gateway/run.py:_resolve_runtime_agent_kwargs and cli.py:4881-4914 for consistent worker-startup contract across all surfaces (CLI / gateway-spawned / oneshot).
  • Optional: emit metric counter when fallback fires so silent "always-failing primary" is detectable.

🤖 Generated with Claude Code

…ot just runtime

Closes #6.

Problem
Worker startup calls `resolve_runtime_provider` to acquire credentials for
the primary provider. If that raises AuthError (xAI OAuth token expired,
Anthropic logged out, Codex revoked), the worker crashes before AIAgent's
runtime fallback loop ever gets a chance — even though the user has
explicitly configured a fallback chain for exactly this case.

Observed in the v6.6 incident 2026-06-07: xAI OAuth token went missing
mid-session and every subsequent worker crashed at startup despite having
`fallback_providers: [openai-codex/gpt-5.5, xai-oauth/grok-4.3]` configured.

Solution
New helper `_resolve_runtime_with_fallback` wraps the primary-resolution
call. On AuthError, iterates the configured fallback chain (read once from
`get_fallback_chain(cfg)`) until one succeeds. If all fail, re-raises the
LAST AuthError so cli.py's exit handling can surface it.

Three safety bounds preserved (informed by code-review):

1. **Explicit CLI pin** — `hermes -z --model X --provider Y ...` should NOT
   silently downgrade. When `model` OR `provider` was a non-empty CLI arg,
   the helper re-raises primary AuthError verbatim, no fallback attempt.

2. **Rate-limit AuthError on primary** — falling through to other providers
   would burn their quota in milliseconds (the "quota amplification"
   footgun). Detected via existing `is_rate_limited_auth_error()` —
   re-raise immediately; existing rate-limit handling (cli.py exit 75)
   gets the task requeued.

3. **Remaining-chain handoff to AIAgent** — when fallback lands on chain
   entry [N], AIAgent's runtime fallback loop should only see entries
   AFTER N (not the dead primary, not the entry we just used). The helper
   now returns `(runtime, effective_model, landed_at_index, remaining_chain)`
   and the caller passes `remaining` to AIAgent's `fallback_model`.

Implementation
- `hermes_cli/oneshot.py:33-110` — new helper (testable at module level).
- `hermes_cli/oneshot.py:439-460` — call site updated; reads chain once,
  detects explicit_pin from CLI args, passes remaining_chain to AIAgent.
- AIAgent receives the correctly-sliced chain via `fallback_model=_fb`,
  preserving existing runtime-fallback semantics for mid-conversation
  failures.

Tests (9/9 passing) — tests/cli/test_oneshot_runtime_fallback.py
- primary succeeds → no fallback attempted, full chain preserved for AIAgent
- primary fails, first fallback succeeds → effective_model advances,
  remaining_chain sliced correctly
- two failures → third succeeds, slicing correct
- all fail → LAST AuthError propagates (not primary's)
- empty chain → primary error verbatim
- fallback without model → effective_model preserved
- explicit_pin=True → no fallback, primary error verbatim
- rate-limit AuthError → no fallback, primary error verbatim
- same provider in chain → no infinite loop, advances to next entry

Code-review pre-merge: reviewer caught silent-downgrade regression, stale
chain handoff, and quota-amplification footgun. All three addressed.

Follow-up (separate issues, not blocking)
- Consider applying the same pattern to `gateway/run.py:_resolve_runtime_agent_kwargs`
  and `cli.py:4881-4914` for a consistent worker-startup contract across surfaces.
- Optional: emit a metric/heartbeat counter when fallback fires so we can
  detect "constantly failing primary" silently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jarvis-stark-ops jarvis-stark-ops merged commit ae8707b into main Jun 8, 2026
@jarvis-stark-ops jarvis-stark-ops deleted the wt/xai-auth-tolerance branch June 8, 2026 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dispatcher: detect xAI OAuth crashes (and similar transient auth failures) and skip provider instead of auto-blocking task

1 participant