Credential pool: token_invalidated / token_revoked should be terminal failures, not 1-hour cooldowns

## Summary

When an OpenAI-Codex OAuth credential is revoked or invalidated upstream, Hermes marks it `exhausted` with a 1-hour TTL cooldown. After the TTL expires, the broken credential re-enters the rotation pool and fails again — usually on context compression in long sessions, where the failure surfaces as `Failed to generate context summary` and breaks the session's compaction.

Cooldown semantics are correct for transient errors (`429`, `5xx`, quota throttles). They are wrong for permanent OAuth states like `token_invalidated` and `token_revoked`.

## Repro

Today (2026-05-26) on a fresh install, observed 7 separate `401 token_invalidated` failures from the same revoked Codex OAuth credential between 10:27 and 17:50 UTC:

```
Failed to generate context summary:
Error code: 401 - {'error': {'message': 'Your authentication token has been invalidated. Please try signing in again.',
                              'type': 'invalid_request_error',
                              'code': 'token_invalidated',
                              'param': None},
                   'status': 401}
```

Removing the credential manually via `hermes auth` → option 2 → remove openai-codex #1 silenced the failures temporarily, but a fresh credential under the same label `Hermes Agent Codex` re-appeared in the pool later (separate re-auth flow possibly, or the cooldown re-rotating from a stale `auth.json` entry — needs upstream confirmation).

## Root cause

In `hermes-agent` credential pool logic:

```python
EXHAUSTED_TTL_DEFAULT_SECONDS = 60 * 60
```

A `401 token_invalidated` from the model provider takes the same `exhausted` code path as a `429 rate_limit` — both get a 1-hour TTL, after which the credential is re-considered eligible. This means a permanently-revoked OAuth token will keep getting picked back up.

## Expected behavior

`token_invalidated` and `token_revoked` should transition the credential to a terminal `dead` state — never re-enter rotation until the credential is explicitly re-added or refreshed by the operator. Other 401 codes (e.g. `token_expired` if Hermes can refresh) should keep cooldown semantics, but `_invalidated` / `_revoked` cannot be auto-recovered.

## Suggested fix

Extend the credential pool state machine to include a `dead` state alongside `exhausted`:

- 429 / 503 / network errors → `exhausted` with TTL cooldown (current behavior, correct)
- 401 with `code == 'token_invalidated'` or `code == 'token_revoked'` → `dead`, no auto-recovery
- Successful OAuth refresh on a `dead` credential → transition back to `ok`
- `dead` credentials excluded from `pick_for_provider()` unconditionally
- `hermes auth` UI surfaces `dead` separately so the operator can see why it's offline

This mirrors the pattern already implemented in some open-source job-queue libraries (e.g., Sidekiq's `dead` set vs. `retry` set).

## Workaround for affected users

Manual: `hermes auth` → option 2 → remove the openai-codex credential whose `last_status: exhausted` and `last_error_reason: token_invalidated`. Do NOT switch `auxiliary.compression.provider` to a different model provider as a workaround if that other provider is per-token-billed (e.g., Anthropic API) — at agentic-run query volume the bill gets expensive fast.

## Environment

- Hermes config: `model.provider = openai-codex`, `model.default = gpt-5.5`, `auxiliary.compression.provider = auto`
- Pool: 3 openai-codex credentials, 1 revoked, 2 healthy
- Failure surfaces in `~/.hermes/logs/errors.log` as `Failed to generate context summary: 401 token_invalidated`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Credential pool: token_invalidated / token_revoked should be terminal failures, not 1-hour cooldowns #32849

Summary

Repro

Root cause

Expected behavior

Suggested fix

Workaround for affected users

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Credential pool: token_invalidated / token_revoked should be terminal failures, not 1-hour cooldowns #32849

Description

Summary

Repro

Root cause

Expected behavior

Suggested fix

Workaround for affected users

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions