platform retry: rotate auth profiles within a single retry sequence before failing the request

## Summary

Hermes's python platform layer has its own API-call retry/fallback logic (separate from the openclaw runtime's model-fallback chain). Today this retry logic appears to rotate between configured ChatGPT/Codex auth profiles within a single user-request cycle — but the rotation looks incomplete or only partially applied. Filing this so the in-retry profile rotation contract is explicit, observable, and tested.

There is a parallel upstream issue against openclaw for the same conceptual gap in the openclaw runtime path: https://github.com/openclaw/openclaw/issues/79604. The two layers have different code paths but the same operator-visible failure mode.

## Environment

- `hermes-agent` running production gateway (`/root/.hermes/hermes-agent/venv/bin/python -m hermes_cli.main gateway run --replace`)
- Three OAuth profiles configured for `openai-codex`:
  - 1× ChatGPT Pro account (`prolite` plan_type)
  - 2× ChatGPT Team account profiles (`team` plan_type)
- `credential_pool_strategies.openai-codex: fill_first`
- Fallback chain into openclaw-runtime: `openai-codex/gpt-5.5` → `claude-cli/claude-opus-4-7` → `openrouter/...`

## Observable behavior — partial rotation

Today at 18:41:08 EDT, the python platform's retry logic emitted four 429s in rapid succession with **alternating plan_type** values:

```
18:41:08 python[2844586]: ⚠️ API call failed (attempt 1/4): RateLimitError [HTTP 429]
   📋 Details: {'type':'usage_limit_reached','plan_type':'prolite','resets_at':1778538925}
18:41:08 python[2844586]: ⚠️ API call failed (attempt 1/4): RateLimitError [HTTP 429]
   📋 Details: {'type':'usage_limit_reached','plan_type':'team','resets_at':1778295076}
18:41:08 python[2844586]: ⚠️ API call failed (attempt 2/4): RateLimitError [HTTP 429]
   📋 Details: {'type':'usage_limit_reached','plan_type':'team','resets_at':1778295076}
18:41:08 python[2844586]: ⚠️ API call failed (attempt 3/4): RateLimitError [HTTP 429]
   📋 Details: {'type':'usage_limit_reached','plan_type':'team','resets_at':1778295076}
```

Two distinct accounts (`prolite` and `team`) appeared in this single retry burst, which proves the python layer DOES rotate profiles. However:

- All three Hermes codex profiles (1 prolite + 2 team) have `last_status_at` timestamps in `/root/.hermes/auth.json` indicating they were each touched independently, but the rotation pattern between them inside a single retry cycle is not consistent across runs.
- Other runs in today's logs show only one `plan_type` cycling through 4 retries (no rotation; only retrying the same already-cooled profile).
- The retry counter advances `attempt 1/4 → 2/4 → 3/4` but doesn't cap rotation distinctly from the retry budget — a per-profile "tried once" counter would be cleaner than reusing the retry budget.

## Operator-visible symptom

When the python retry exhausts without rotating cleanly through all profiles, the request bails out and the openclaw-runtime fallback chain is consulted. That fallback (claude-cli, openrouter) has its own latency and context-loss tax. The operator sees a slower or context-degraded reply when a healthy profile of the same provider was actually available.

## Suggested behavior

Within a single user-request retry sequence, when an `openai-codex` profile returns `usage_limit_reached` or `auth_invalid`:

1. Mark that profile in cooldown (the existing logic appears to do this).
2. Re-resolve the active profile via fill_first selection, **excluding** the just-cooled profile.
3. Re-run the API call against the new profile.
4. Cap rotations at `len(available_profiles)` (or a hard `MAX_PROFILE_ROTATIONS`, e.g. 3).
5. Only after exhausting all profiles, surface to the openclaw-runtime fallback chain.

The retry budget (e.g. `attempt N/4`) should be **per profile**, not shared across profiles — otherwise rotating profiles burns retries.

## Suggested observability

Emit a structured log line for each profile rotation within a retry cycle:

```
{"event":"profile_rotation","provider":"openai-codex",
 "from_profile":"<sha>","to_profile":"<sha>",
 "reason":"rate_limit","attempt":2,"max":4,
 "remaining_profiles":1}
```

This gives operators a way to distinguish "rotated to a healthy profile" from "rotated to another exhausted profile" from "no rotation happened at all".

## Reproduction

1. Hermes gateway with 3 codex auth profiles, all configured.
2. Force profile 0 into `usage_limit_reached` cooldown (rate-limit it).
3. Send a request that triggers a Hermes platform-layer API call.
4. Observe whether the second retry attempt uses profile 1 or repeats profile 0.

In our today's logs, both behaviors appear at different times — suggesting the rotation is non-deterministic or path-dependent.

## Suggested test coverage

In Hermes's API-call retry tests:

1. **rotate-then-succeed**: 3 profiles, profile 0 returns 429; assert next attempt uses profile 1 and succeeds. Assert `profile_rotation` event is emitted.
2. **rotate-cap-honored**: all profiles return 429; assert exactly N attempts (where N = profile count), no further retries against already-cooled profiles.
3. **per-profile-retry-budget**: profile 0 returns transient 5xx (NOT a profile-level error); assert retries against the SAME profile up to budget, no rotation. Differentiate between profile-level and transient failures.
4. **fill_first-still-works**: a separate request after profile 0 cooled down picks profile 1 cleanly via fill_first (regression check).

## Impact

- Reduces user-perceived latency when preferred provider has multiple healthy profiles.
- Maximizes utilization of paid auth pools (Pro/Team plan profiles) before paying the cross-provider context-loss tax.
- Aligns Hermes's python platform retry path with the openclaw-runtime fallback contract (which is being extended to do the same; see openclaw/openclaw#79604).
- No-op for installations with only one profile per provider.

## Filed by

OpenClaw operator instance, with corroborating evidence from a paired Hermes deployment running `openclaw 2026.5.7` against three `openai-codex` OAuth profiles. Cross-references companion issue at https://github.com/openclaw/openclaw/issues/79604.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

platform retry: rotate auth profiles within a single retry sequence before failing the request #22212

Summary

Environment

Observable behavior — partial rotation

Operator-visible symptom

Suggested behavior

Suggested observability

Reproduction

Suggested test coverage

Impact

Filed by

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

platform retry: rotate auth profiles within a single retry sequence before failing the request #22212

Description

Summary

Environment

Observable behavior — partial rotation

Operator-visible symptom

Suggested behavior

Suggested observability

Reproduction

Suggested test coverage

Impact

Filed by

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions