Skip to content

platform retry: rotate auth profiles within a single retry sequence before failing the request #22212

@wherewolf87

Description

@wherewolf87

Summary

Hermes's python platform layer has its own API-call retry/fallback logic (separate from the openclaw runtime's model-fallback chain). Today this retry logic appears to rotate between configured ChatGPT/Codex auth profiles within a single user-request cycle — but the rotation looks incomplete or only partially applied. Filing this so the in-retry profile rotation contract is explicit, observable, and tested.

There is a parallel upstream issue against openclaw for the same conceptual gap in the openclaw runtime path: openclaw/openclaw#79604. The two layers have different code paths but the same operator-visible failure mode.

Environment

  • hermes-agent running production gateway (/root/.hermes/hermes-agent/venv/bin/python -m hermes_cli.main gateway run --replace)
  • Three OAuth profiles configured for openai-codex:
    • 1× ChatGPT Pro account (prolite plan_type)
    • 2× ChatGPT Team account profiles (team plan_type)
  • credential_pool_strategies.openai-codex: fill_first
  • Fallback chain into openclaw-runtime: openai-codex/gpt-5.5claude-cli/claude-opus-4-7openrouter/...

Observable behavior — partial rotation

Today at 18:41:08 EDT, the python platform's retry logic emitted four 429s in rapid succession with alternating plan_type values:

18:41:08 python[2844586]: ⚠️ API call failed (attempt 1/4): RateLimitError [HTTP 429]
   📋 Details: {'type':'usage_limit_reached','plan_type':'prolite','resets_at':1778538925}
18:41:08 python[2844586]: ⚠️ API call failed (attempt 1/4): RateLimitError [HTTP 429]
   📋 Details: {'type':'usage_limit_reached','plan_type':'team','resets_at':1778295076}
18:41:08 python[2844586]: ⚠️ API call failed (attempt 2/4): RateLimitError [HTTP 429]
   📋 Details: {'type':'usage_limit_reached','plan_type':'team','resets_at':1778295076}
18:41:08 python[2844586]: ⚠️ API call failed (attempt 3/4): RateLimitError [HTTP 429]
   📋 Details: {'type':'usage_limit_reached','plan_type':'team','resets_at':1778295076}

Two distinct accounts (prolite and team) appeared in this single retry burst, which proves the python layer DOES rotate profiles. However:

  • All three Hermes codex profiles (1 prolite + 2 team) have last_status_at timestamps in /root/.hermes/auth.json indicating they were each touched independently, but the rotation pattern between them inside a single retry cycle is not consistent across runs.
  • Other runs in today's logs show only one plan_type cycling through 4 retries (no rotation; only retrying the same already-cooled profile).
  • The retry counter advances attempt 1/4 → 2/4 → 3/4 but doesn't cap rotation distinctly from the retry budget — a per-profile "tried once" counter would be cleaner than reusing the retry budget.

Operator-visible symptom

When the python retry exhausts without rotating cleanly through all profiles, the request bails out and the openclaw-runtime fallback chain is consulted. That fallback (claude-cli, openrouter) has its own latency and context-loss tax. The operator sees a slower or context-degraded reply when a healthy profile of the same provider was actually available.

Suggested behavior

Within a single user-request retry sequence, when an openai-codex profile returns usage_limit_reached or auth_invalid:

  1. Mark that profile in cooldown (the existing logic appears to do this).
  2. Re-resolve the active profile via fill_first selection, excluding the just-cooled profile.
  3. Re-run the API call against the new profile.
  4. Cap rotations at len(available_profiles) (or a hard MAX_PROFILE_ROTATIONS, e.g. 3).
  5. Only after exhausting all profiles, surface to the openclaw-runtime fallback chain.

The retry budget (e.g. attempt N/4) should be per profile, not shared across profiles — otherwise rotating profiles burns retries.

Suggested observability

Emit a structured log line for each profile rotation within a retry cycle:

{"event":"profile_rotation","provider":"openai-codex",
 "from_profile":"<sha>","to_profile":"<sha>",
 "reason":"rate_limit","attempt":2,"max":4,
 "remaining_profiles":1}

This gives operators a way to distinguish "rotated to a healthy profile" from "rotated to another exhausted profile" from "no rotation happened at all".

Reproduction

  1. Hermes gateway with 3 codex auth profiles, all configured.
  2. Force profile 0 into usage_limit_reached cooldown (rate-limit it).
  3. Send a request that triggers a Hermes platform-layer API call.
  4. Observe whether the second retry attempt uses profile 1 or repeats profile 0.

In our today's logs, both behaviors appear at different times — suggesting the rotation is non-deterministic or path-dependent.

Suggested test coverage

In Hermes's API-call retry tests:

  1. rotate-then-succeed: 3 profiles, profile 0 returns 429; assert next attempt uses profile 1 and succeeds. Assert profile_rotation event is emitted.
  2. rotate-cap-honored: all profiles return 429; assert exactly N attempts (where N = profile count), no further retries against already-cooled profiles.
  3. per-profile-retry-budget: profile 0 returns transient 5xx (NOT a profile-level error); assert retries against the SAME profile up to budget, no rotation. Differentiate between profile-level and transient failures.
  4. fill_first-still-works: a separate request after profile 0 cooled down picks profile 1 cleanly via fill_first (regression check).

Impact

Filed by

OpenClaw operator instance, with corroborating evidence from a paired Hermes deployment running openclaw 2026.5.7 against three openai-codex OAuth profiles. Cross-references companion issue at openclaw/openclaw#79604.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existsarea/authAuthentication, OAuth, credential poolscomp/agentCore agent loop, run_agent.py, prompt buildertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions