fix(oneshot): honor fallback_providers chain during worker startup, not just runtime by jarvis-stark-ops · Pull Request #9 · 1Team-Engineering/hermes-agent

jarvis-stark-ops · 2026-06-08T01:00:52Z

Summary

Closes Dispatcher: detect xAI OAuth crashes (and similar transient auth failures) and skip provider instead of auto-blocking task #6.
Worker startup resolve_runtime_provider AuthError now iterates the configured fallback_providers chain instead of crashing the worker.
Three safety bounds preserved (explicit CLI pin, rate-limit, remaining-chain handoff) per pre-merge code review.

Why this matters

v6.6 incident (2026-06-07): xAI OAuth token went missing mid-session and every subsequent worker crashed at startup, despite each profile having fallback_providers: [openai-codex/gpt-5.5, xai-oauth/grok-4.3] configured. The fallback chain was only honored AFTER successful credential resolution (AIAgent's runtime loop), not DURING initial worker startup. This fixes that.

Combined with #7 (now merged — dispatcher heartbeat) and #5 (still open — 429 exit-code mapping), the worker-startup → dispatcher-detect → next-retry loop is now operationally robust.

Test plan (9/9 passing)

Code-review focus

Safety bound: explicit CLI pin — hermes -z --model X --provider Y should NOT silently downgrade. New explicit_pin: bool parameter, derived from CLI args (not env vars, not config).
Safety bound: rate-limit on primary — re-uses existing is_rate_limited_auth_error() from auth.py.
Safety bound: remaining-chain handoff — helper now returns (runtime, model, landed_idx, remaining_chain); caller passes remaining to AIAgent so its runtime loop doesn't re-try the dead primary.
Hoisted get_fallback_chain(cfg) to one local — was being called twice (TOCTOU on mutable cfg).

Follow-up (separate issues, not blocking)

Consider applying the same pattern to gateway/run.py:_resolve_runtime_agent_kwargs and cli.py:4881-4914 for consistent worker-startup contract across all surfaces (CLI / gateway-spawned / oneshot).
Optional: emit metric counter when fallback fires so silent "always-failing primary" is detectable.

🤖 Generated with Claude Code

…ot just runtime Closes #6. Problem Worker startup calls `resolve_runtime_provider` to acquire credentials for the primary provider. If that raises AuthError (xAI OAuth token expired, Anthropic logged out, Codex revoked), the worker crashes before AIAgent's runtime fallback loop ever gets a chance — even though the user has explicitly configured a fallback chain for exactly this case. Observed in the v6.6 incident 2026-06-07: xAI OAuth token went missing mid-session and every subsequent worker crashed at startup despite having `fallback_providers: [openai-codex/gpt-5.5, xai-oauth/grok-4.3]` configured. Solution New helper `_resolve_runtime_with_fallback` wraps the primary-resolution call. On AuthError, iterates the configured fallback chain (read once from `get_fallback_chain(cfg)`) until one succeeds. If all fail, re-raises the LAST AuthError so cli.py's exit handling can surface it. Three safety bounds preserved (informed by code-review): 1. **Explicit CLI pin** — `hermes -z --model X --provider Y ...` should NOT silently downgrade. When `model` OR `provider` was a non-empty CLI arg, the helper re-raises primary AuthError verbatim, no fallback attempt. 2. **Rate-limit AuthError on primary** — falling through to other providers would burn their quota in milliseconds (the "quota amplification" footgun). Detected via existing `is_rate_limited_auth_error()` — re-raise immediately; existing rate-limit handling (cli.py exit 75) gets the task requeued. 3. **Remaining-chain handoff to AIAgent** — when fallback lands on chain entry [N], AIAgent's runtime fallback loop should only see entries AFTER N (not the dead primary, not the entry we just used). The helper now returns `(runtime, effective_model, landed_at_index, remaining_chain)` and the caller passes `remaining` to AIAgent's `fallback_model`. Implementation - `hermes_cli/oneshot.py:33-110` — new helper (testable at module level). - `hermes_cli/oneshot.py:439-460` — call site updated; reads chain once, detects explicit_pin from CLI args, passes remaining_chain to AIAgent. - AIAgent receives the correctly-sliced chain via `fallback_model=_fb`, preserving existing runtime-fallback semantics for mid-conversation failures. Tests (9/9 passing) — tests/cli/test_oneshot_runtime_fallback.py - primary succeeds → no fallback attempted, full chain preserved for AIAgent - primary fails, first fallback succeeds → effective_model advances, remaining_chain sliced correctly - two failures → third succeeds, slicing correct - all fail → LAST AuthError propagates (not primary's) - empty chain → primary error verbatim - fallback without model → effective_model preserved - explicit_pin=True → no fallback, primary error verbatim - rate-limit AuthError → no fallback, primary error verbatim - same provider in chain → no infinite loop, advances to next entry Code-review pre-merge: reviewer caught silent-downgrade regression, stale chain handoff, and quota-amplification footgun. All three addressed. Follow-up (separate issues, not blocking) - Consider applying the same pattern to `gateway/run.py:_resolve_runtime_agent_kwargs` and `cli.py:4881-4914` for a consistent worker-startup contract across surfaces. - Optional: emit a metric/heartbeat counter when fallback fires so we can detect "constantly failing primary" silently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jarvis-stark-ops merged commit ae8707b into main Jun 8, 2026

jarvis-stark-ops deleted the wt/xai-auth-tolerance branch June 8, 2026 01:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(oneshot): honor fallback_providers chain during worker startup, not just runtime#9

fix(oneshot): honor fallback_providers chain during worker startup, not just runtime#9
jarvis-stark-ops merged 1 commit into
mainfrom
wt/xai-auth-tolerance

jarvis-stark-ops commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jarvis-stark-ops commented Jun 8, 2026

Summary

Why this matters

Test plan (9/9 passing)

Code-review focus

Follow-up (separate issues, not blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant