[Bug]: OAuth token refresh fails on headless gateway — wrong endpoint + no recovery, causing persistent 401s with Anthropic Max

## Summary

When running Hermes with Anthropic Max (OAuth `sk-ant-oat` tokens) on a headless server via the gateway, the token expires after ~8 hours and auto-refresh fails silently, causing persistent 401 errors with no recovery path. The gateway continues serving error responses to all connected chat platforms until manually restarted with a fresh token.

This does not happen with other OAuth-based agent frameworks (e.g., OpenClaw) using the same Anthropic Max account and token structure.

## Environment

- Hermes Agent (installed via git, latest main branch as of Mar 25, 2026)
- Provider: `anthropic` (native Messages API)
- Auth: Anthropic Max subscription via OAuth token (`sk-ant-oat...`)
- Platform: Ubuntu 22.04 headless server, gateway mode (Telegram)
- No browser available on server

## Steps to Reproduce

1. Set up Hermes gateway on a headless server with `model.provider: anthropic`
2. Authenticate with Anthropic Max (OAuth token in `ANTHROPIC_API_KEY` or `CLAUDE_CODE_OAUTH_TOKEN` in `~/.hermes/.env`)
3. Run `claude auth login` to populate `~/.claude/.credentials.json` with refresh token (requires manual browser interaction — already painful on headless)
4. Wait ~8 hours for the access token to expire
5. Send a message to any connected platform

## Actual Behavior

- Gateway returns 401 errors: `{'type': 'authentication_error', 'message': 'invalid x-api-key'}`
- Error response sent to user for every message: "Sorry, I encountered an error (AuthenticationError)... Check your API key or run `claude /login`"
- `_refresh_oauth_token()` in `agent/anthropic_adapter.py` fails because:
  - It uses `https://console.anthropic.com/v1/oauth/token` with `application/x-www-form-urlencoded` (line 231)
  - The actual working endpoint is `https://platform.claude.com/v1/oauth/token` with `application/json` content type (confirmed by reading Claude Code's own `cli.js` source)
  - The refresh call returns HTTP 500, so no auto-refresh ever succeeds
- Gateway continues failing on every subsequent message with no self-healing
- Requires full manual re-authentication (browser-based OAuth flow) + gateway restart

## Expected Behavior

- Token refresh should work automatically using the correct endpoint and format
- Gateway should detect persistent auth failures and attempt credential refresh proactively
- On a headless server, the OAuth flow should be manageable without requiring repeated browser interaction

## Root Cause Analysis

### 1. Wrong token endpoint for refresh (bug)

`_refresh_oauth_token()` in `agent/anthropic_adapter.py` line 231:
```python
req = urllib.request.Request(
    "https://console.anthropic.com/v1/oauth/token",  # WRONG
    data=data,
    headers={
        "Content-Type": "application/x-www-form-urlencoded",  # WRONG for auth code exchange
    },
)
```

Claude Code's own source (`cli.js`) uses:
```javascript
TOKEN_URL: "https://platform.claude.com/v1/oauth/token"
// Uses application/json for token exchange
```

The refresh token grant may work with form-urlencoded on `console.anthropic.com` for some token types, but returns HTTP 500 for tokens obtained via the Claude Code OAuth flow.

### 2. No startup validation (design gap)

When `ANTHROPIC_API_KEY` is an OAuth token (`sk-ant-oat...`) with no `~/.claude/.credentials.json` present, the gateway starts normally and only fails hours later when the token expires. There should be a startup warning: "OAuth token detected without refresh credentials — token will expire and cannot be auto-renewed."

### 3. Gateway doesn't recover from persistent 401s (design gap)

When auth fails, the gateway logs the error and sends an error message to the user, but makes no attempt to:
- Re-read credentials from disk
- Attempt token refresh
- Alert the operator via a different channel
- Enter a degraded mode that retries periodically

Each subsequent message hits the same expired token and fails identically.

### 4. `claude auth login` is impractical on headless servers

The only way to get a refresh token is `claude auth login`, which tries to open a browser. On a headless server, this requires manual workaround (generate PKCE challenge, open URL on another machine, paste code back). The token exchange itself requires knowledge of the correct endpoint (`platform.claude.com`, not `console.anthropic.com`) and format (`application/json` with `state` parameter).

## Suggested Fixes

1. **Fix the refresh token endpoint** — Update `_refresh_oauth_token()` to use `https://platform.claude.com/v1/oauth/token` and try both `application/json` and `application/x-www-form-urlencoded` content types for resilience.

2. **Startup credential validation** — At gateway start, if the resolved token is an OAuth token, check for a valid refresh token source. Warn loudly if none exists.

3. **Gateway auth failure recovery** — On 401 errors, attempt to refresh credentials from `~/.claude/.credentials.json` before returning the error to the user. If multiple consecutive 401s occur, log a prominent warning.

4. **Headless OAuth support** — Provide a `hermes auth login --headless` command that handles the PKCE flow end-to-end, printing the URL and accepting the code via stdin, without requiring `claude` CLI's browser-opening behavior.

## Logs

```
2026-03-25 03:47:47,853 ERROR root: Non-retryable client error: Error code: 401 - {'type': 'error', 'error': {'type': 'authentication_error', 'message': 'invalid x-api-key'}}
2026-03-25 03:48:14,166 ERROR root: Non-retryable client error: Error code: 401 - {'type': 'error', 'error': {'type': 'authentication_error', 'message': 'invalid x-api-key'}}
2026-03-25 03:48:57,657 ERROR root: Non-retryable client error: Error code: 401 - {'type': 'error', 'error': {'type': 'authentication_error', 'message': 'invalid x-api-key'}}
2026-03-25 13:05:49,431 ERROR root: Non-retryable client error: Error code: 401 - {'type': 'error', 'error': {'type': 'authentication_error', 'message': 'Invalid authentication credentials'}}
2026-03-25 13:20:50,585 ERROR root: Non-retryable client error: Error code: 401
2026-03-25 13:35:51,743 ERROR root: Non-retryable client error: Error code: 401
2026-03-25 13:50:52,931 ERROR root: Non-retryable client error: Error code: 401
2026-03-25 13:58:20,929 ERROR root: Non-retryable client error: Error code: 401
```

## Related

- #2374 (MiniMax auth overridden by Anthropic token refresh — same `resolve_anthropic_token()` code path)
- #1739 (Alibaba auth 401 from same root cause)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: OAuth token refresh fails on headless gateway — wrong endpoint + no recovery, causing persistent 401s with Anthropic Max #2962

Summary

Environment

Steps to Reproduce

Actual Behavior

Expected Behavior

Root Cause Analysis

1. Wrong token endpoint for refresh (bug)

2. No startup validation (design gap)

3. Gateway doesn't recover from persistent 401s (design gap)

4. `claude auth login` is impractical on headless servers

Suggested Fixes

Logs

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: OAuth token refresh fails on headless gateway — wrong endpoint + no recovery, causing persistent 401s with Anthropic Max #2962

Description

Summary

Environment

Steps to Reproduce

Actual Behavior

Expected Behavior

Root Cause Analysis

1. Wrong token endpoint for refresh (bug)

2. No startup validation (design gap)

3. Gateway doesn't recover from persistent 401s (design gap)

4. claude auth login is impractical on headless servers

Suggested Fixes

Logs

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

4. `claude auth login` is impractical on headless servers