Problem
When the configured xAI OAuth token expires/becomes invalid mid-session, the dispatcher's auth probe crashes with RuntimeError: xAI OAuth state is missing access_token. The gateway's kanban dispatch loop stops cycling but the gateway process stays alive (other platform-retry threads keep going). To external observers it looks like the gateway is healthy; in reality no kanban work is being dispatched.
Evidence
Gateway logs 2026-06-07:
RuntimeError: xAI OAuth state is missing access_token. Re-authenticate with `hermes model`.
WARNING gateway.run: Shutdown context: signal=SIGTERM ...
[dispatcher cycles stop logging from 13:24:34 onward]
After this point:
- 3 onboarding-test step-07 reviewers (tony, tchalla, vision) sat
ready for 1h+ without claim
- Gateway PID still alive
- No
kanban dispatcher [default]: spawned=X ... lines after 13:24:34
Only a manual hermes gateway run --replace recovers.
Why this matters
Silent stalls are the worst failure mode for an autonomous system. If a provider's auth expires (which they all do — OAuth tokens rotate), the entire dispatch loop should not die.
Combined with #5 (429 → delayed retry), this completes the 'no single provider failure mode strands the team' story.
Acceptance criteria
- xAI OAuth probe failure is caught; the offending provider is marked unhealthy
- Dispatcher continues running, only skipping that provider until re-auth
- Visible WARN log:
provider xai-oauth marked unhealthy: <reason>; team will use fallback chain until re-auth
- Dispatcher health metric exposed (next-cycle time, providers-skipped) so silent stalls are detectable
- Test: kill xAI auth state mid-session → assert dispatcher keeps cycling other providers
Related
Problem
When the configured xAI OAuth token expires/becomes invalid mid-session, the dispatcher's auth probe crashes with
RuntimeError: xAI OAuth state is missing access_token. The gateway's kanban dispatch loop stops cycling but the gateway process stays alive (other platform-retry threads keep going). To external observers it looks like the gateway is healthy; in reality no kanban work is being dispatched.Evidence
Gateway logs 2026-06-07:
After this point:
readyfor 1h+ without claimkanban dispatcher [default]: spawned=X ...lines after 13:24:34Only a manual
hermes gateway run --replacerecovers.Why this matters
Silent stalls are the worst failure mode for an autonomous system. If a provider's auth expires (which they all do — OAuth tokens rotate), the entire dispatch loop should not die.
Combined with #5 (429 → delayed retry), this completes the 'no single provider failure mode strands the team' story.
Acceptance criteria
provider xai-oauth marked unhealthy: <reason>; team will use fallback chain until re-authRelated