Skip to content

Dispatcher: detect xAI OAuth crashes (and similar transient auth failures) and skip provider instead of auto-blocking task #6

@jarvis-stark-ops

Description

@jarvis-stark-ops

Problem

When the configured xAI OAuth token expires/becomes invalid mid-session, the dispatcher's auth probe crashes with RuntimeError: xAI OAuth state is missing access_token. The gateway's kanban dispatch loop stops cycling but the gateway process stays alive (other platform-retry threads keep going). To external observers it looks like the gateway is healthy; in reality no kanban work is being dispatched.

Evidence

Gateway logs 2026-06-07:

RuntimeError: xAI OAuth state is missing access_token. Re-authenticate with `hermes model`.
WARNING gateway.run: Shutdown context: signal=SIGTERM ...
[dispatcher cycles stop logging from 13:24:34 onward]

After this point:

  • 3 onboarding-test step-07 reviewers (tony, tchalla, vision) sat ready for 1h+ without claim
  • Gateway PID still alive
  • No kanban dispatcher [default]: spawned=X ... lines after 13:24:34

Only a manual hermes gateway run --replace recovers.

Why this matters

Silent stalls are the worst failure mode for an autonomous system. If a provider's auth expires (which they all do — OAuth tokens rotate), the entire dispatch loop should not die.

Combined with #5 (429 → delayed retry), this completes the 'no single provider failure mode strands the team' story.

Acceptance criteria

  • xAI OAuth probe failure is caught; the offending provider is marked unhealthy
  • Dispatcher continues running, only skipping that provider until re-auth
  • Visible WARN log: provider xai-oauth marked unhealthy: <reason>; team will use fallback chain until re-auth
  • Dispatcher health metric exposed (next-cycle time, providers-skipped) so silent stalls are detectable
  • Test: kill xAI auth state mid-session → assert dispatcher keeps cycling other providers

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions