Summary
_is_entitlement_failure in run_agent.py over-matches on xAI Grok 403 responses, causing legitimate "OAuth access token failed validation" errors to be misclassified as unsubscribed-account entitlement failures. The defensive guard against entitlement refresh loops (existing test references issue #26847) suppresses the refresh-on-401 path for both real cases, leaving long-running TUI sessions stuck on a stale token with no recovery.
Workaround: exit and reopen the TUI — the startup refresh path bypasses the broken classifier.
Repro
- Open a Hermes TUI session against
provider/xai-oauth (SuperGrok).
- Let it sit idle long enough that the access token goes stale by xAI's server-side criteria (in my case, ~22 hours; can happen sooner if xAI rotates session-side).
- Send a request.
- xAI returns HTTP 403 with this body:
{
"code": "The caller does not have permission to execute the specified operation",
"error": "The OAuth2 access token could not be validated. [WKE=unauthenticated:bad-credentials]"
}
- Hermes logs
Non-retryable client error and surfaces it to the user. No refresh attempt happens, even though the credential pool's _refresh_entry for this provider works fine (proven by opening a new TUI session — the startup-resolve path refreshes successfully).
Expected
The [WKE=unauthenticated:bad-credentials] suffix unambiguously indicates this is a credential-validation failure, not an entitlement failure. Hermes should:
- Call
_recover_with_credential_pool → try_refresh_current() → _swap_credential
- Retry the request with the refreshed token
- Either succeed (the typical case after a stale token) or, if the refresh itself fails terminally, fall through to the existing terminal-quarantine path
Actual
_is_entitlement_failure returns True because the response body matches its substring heuristic on "caller does not have permission". The recovery short-circuits, returns False, error surfaces as non-retryable.
Root cause
xAI's API returns the same code field text for two distinct conditions:
| Condition |
code (same) |
error field (the disambiguator) |
| Entitlement (account isn't SuperGrok-subscribed) |
"The caller does not have permission to execute the specified operation" |
"... active Grok subscription. Manage at https://grok.com" (or similar entitlement language) |
| Bad credentials (access token failed validation) |
"The caller does not have permission to execute the specified operation" |
"The OAuth2 access token could not be validated. [WKE=unauthenticated:bad-credentials]" |
The existing tests in tests/run_agent/test_codex_xai_oauth_recovery.py cover the entitlement case correctly (test_is_entitlement_failure_matches_real_xai_bodies), but there's no test case for the bad-credentials variant — so the classifier treats both identically.
The [WKE=unauthenticated:bad-credentials] suffix is xAI's authoritative disambiguator. Hermes currently ignores it.
Proposed fixes (escalating, pick one)
-
Tightest — In _is_entitlement_failure, check the body's error field first: if it contains [WKE=unauthenticated: (or specifically [WKE=unauthenticated:bad-credentials]), return False immediately. Refresh path then handles it.
-
Pragmatic — Require BOTH the entitlement keyword AND the absence of "OAuth2 access token could not be validated" before classifying as entitlement.
-
Safest — When the WKE suffix says unauthenticated, attempt refresh-once before classifying. The existing loop-protection still kicks in on the second 403 if refresh didn't actually help.
Fix #1 is mechanical and matches the explicit disambiguator xAI sends. Recommended.
Test additions
Suggested cases for tests/run_agent/test_codex_xai_oauth_recovery.py:
def test_is_entitlement_failure_false_for_bad_credentials_wke_suffix():
"""403 with WKE=unauthenticated:bad-credentials is auth failure, not entitlement."""
from run_agent import AIAgent
assert not AIAgent._is_entitlement_failure(
{
"code": "The caller does not have permission to execute the specified operation",
"error": "The OAuth2 access token could not be validated. [WKE=unauthenticated:bad-credentials]",
},
403,
)
def test_recover_with_credential_pool_refreshes_on_xai_bad_credentials_403():
"""A bad-credentials 403 from xai-oauth must trigger refresh."""
# Same scaffolding as test_recover_with_credential_pool_still_refreshes_genuine_auth_failure,
# but with status_code=403 and the bad-credentials error body. Should call try_refresh_current().
Impact
- Any long-running TUI / chat session against
provider/xai-oauth will eventually 403 once the token goes stale, and the user has to exit/reopen to recover.
- Bridge adapters (Discord, Telegram, etc.) appear unaffected in practice because their process lifecycle / proactive refresh cadence keeps tokens fresh enough that the reactive-recovery path is rarely exercised. But they're vulnerable to the same bug under the right timing.
- Reproduced on two independent installations of Hermes against two separate SuperGrok-active xAI OAuth accounts — same exact symptom, same exact 403 body.
Environment
- Hermes — recent v0.14.x snapshot (cloned source, current main)
- Python 3.11.15 on Linux
provider/xai-oauth source manual:xai_pkce (not loopback_pkce, but the bug is upstream of the loopback-vs-manual distinction)
- xAI Grok backend,
grok-4.3 model, https://api.x.ai/v1
Summary
_is_entitlement_failureinrun_agent.pyover-matches on xAI Grok 403 responses, causing legitimate "OAuth access token failed validation" errors to be misclassified as unsubscribed-account entitlement failures. The defensive guard against entitlement refresh loops (existing test references issue #26847) suppresses the refresh-on-401 path for both real cases, leaving long-running TUI sessions stuck on a stale token with no recovery.Workaround: exit and reopen the TUI — the startup refresh path bypasses the broken classifier.
Repro
provider/xai-oauth(SuperGrok).{ "code": "The caller does not have permission to execute the specified operation", "error": "The OAuth2 access token could not be validated. [WKE=unauthenticated:bad-credentials]" }Non-retryable client errorand surfaces it to the user. No refresh attempt happens, even though the credential pool's_refresh_entryfor this provider works fine (proven by opening a new TUI session — the startup-resolve path refreshes successfully).Expected
The
[WKE=unauthenticated:bad-credentials]suffix unambiguously indicates this is a credential-validation failure, not an entitlement failure. Hermes should:_recover_with_credential_pool→try_refresh_current()→_swap_credentialActual
_is_entitlement_failurereturnsTruebecause the response body matches its substring heuristic on"caller does not have permission". The recovery short-circuits, returnsFalse, error surfaces as non-retryable.Root cause
xAI's API returns the same
codefield text for two distinct conditions:code(same)errorfield (the disambiguator)"The caller does not have permission to execute the specified operation""... active Grok subscription. Manage at https://grok.com"(or similar entitlement language)"The caller does not have permission to execute the specified operation""The OAuth2 access token could not be validated. [WKE=unauthenticated:bad-credentials]"The existing tests in
tests/run_agent/test_codex_xai_oauth_recovery.pycover the entitlement case correctly (test_is_entitlement_failure_matches_real_xai_bodies), but there's no test case for the bad-credentials variant — so the classifier treats both identically.The
[WKE=unauthenticated:bad-credentials]suffix is xAI's authoritative disambiguator. Hermes currently ignores it.Proposed fixes (escalating, pick one)
Tightest — In
_is_entitlement_failure, check the body'serrorfield first: if it contains[WKE=unauthenticated:(or specifically[WKE=unauthenticated:bad-credentials]), returnFalseimmediately. Refresh path then handles it.Pragmatic — Require BOTH the entitlement keyword AND the absence of
"OAuth2 access token could not be validated"before classifying as entitlement.Safest — When the WKE suffix says
unauthenticated, attempt refresh-once before classifying. The existing loop-protection still kicks in on the second 403 if refresh didn't actually help.Fix #1 is mechanical and matches the explicit disambiguator xAI sends. Recommended.
Test additions
Suggested cases for
tests/run_agent/test_codex_xai_oauth_recovery.py:Impact
provider/xai-oauthwill eventually 403 once the token goes stale, and the user has to exit/reopen to recover.Environment
provider/xai-oauthsourcemanual:xai_pkce(notloopback_pkce, but the bug is upstream of the loopback-vs-manual distinction)grok-4.3model,https://api.x.ai/v1