Skip to content

[Bug]: blockedUntil for subscription_limit set far in the future never re-probes when no fallback is configured #90702

@brtkwr

Description

@brtkwr

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

On 2026.5.28, when an openai-codex auth profile hits its subscription cap and the upstream reports a "next reset in N days" timestamp, OpenClaw stores that timestamp verbatim into auth-state.json as blockedUntil; with fallbacks: [], the probe-during-cooldown path short-circuits on hasFallbackCandidates, so the profile is never re-probed and stays blocked for days even after the rolling cap has recovered.

Steps to reproduce

  1. Install OpenClaw 2026.5.28 and @openclaw/codex@2026.5.28, configure with agents.defaults.model.primary: openai-codex/gpt-5.5 and fallbacks: [].
  2. Drive enough usage to exhaust the rolling weekly cap (in this case, an accidental heartbeat firing every 30 min for ~24 hours).
  3. Observe the upstream returns: You've reached your Codex subscription usage limit. Next reset in 6 days, Jun 7 at 3:43 PM UTC.
  4. Check auth-state.json at agents/main/agent/auth-state.json:
    "openai-codex:<account>": {
      "blockedUntil": 1780846982712,
      "blockedReason": "subscription_limit",
      "blockedSource": "wham",
      "errorCount": 1,
      "failureCounts": { "rate_limit": 1 },
      "lastFailureAt": 1780401970719
    }
  5. Wait 3 days. Observe every scheduled cron lane logs decision=skip_candidate ... Provider openai-codex is in cooldown (suspending lanes). No model calls made.
  6. Run openclaw infer model run --prompt "say hello in one word" directly. Returns successfully — the upstream API is callable. The block is purely OpenClaw-side stale state.

Expected behavior

After the upstream's rolling cap recovers (which happens before the reported "next reset" since it's a rolling window, not a discrete reset), OpenClaw should re-probe the primary and resume serving calls. With no fallback configured, recovery probing should still happen, since "is the primary callable yet?" is a recovery question, not a fallback-switching question.

Actual behavior

The profile stays blocked until blockedUntil arrives literally, regardless of actual API state. In dist/model-fallback-DRgKirrj.js:

function shouldProbePrimaryDuringCooldown(params) {
  if (!params.isPrimary || !params.hasFallbackCandidates) return false;
  // ...
}

The early return on !hasFallbackCandidates means with fallbacks: [], no probe ever fires. Gateway logs confirm: ~250 skip_candidate entries over 3 days, zero attempts at the actual upstream.

OpenClaw version

2026.5.28

Operating system

Ubuntu 24.04

Install method

npm global

Model

openai-codex/gpt-5.5

Provider / routing chain

openclaw -> @openclaw/codex@2026.5.28 -> openai (ChatGPT Plus OAuth)

Additional provider/model setup details

  • Single auth profile: openai-codex:<account> (OAuth, ChatGPT Plus subscription)
  • agents.defaults.model.fallbacks: [] (no fallback configured)
  • compaction.maxActiveTranscriptBytes: "500kb", truncateAfterCompaction: true
  • auth.cooldowns: {} (defaults)

Logs, screenshots, and evidence

Jun 02 12:06:10 [model-fallback/decision] decision=candidate_failed
    requested=openai-codex/gpt-5.5 candidate=openai-codex/gpt-5.5
    reason=rate_limit next=none
    detail=You've reached your Codex subscription usage limit. Next reset in 6 days, Jun 7 at 3:43 PM UTC.
Jun 02 14:30:00 [model-fallback/decision] decision=skip_candidate
    requested=openai-codex/gpt-5.5 candidate=openai-codex/gpt-5.5
    reason=rate_limit next=none
    detail=Provider openai-codex is in cooldown (suspending lanes)
(repeats every scheduled cron tick for 3 days)

The auth-state.json snippet above. Direct openclaw infer model run succeeded immediately after manually clearing blockedUntil.

Impact and severity

  • Affected: any single-host OpenClaw install with one upstream and fallbacks: [] that hits a subscription cap.
  • Severity: blocks workflow — scheduled crons and channel replies stop posting for the entire duration of blockedUntil.
  • Frequency: triggered once per cap exhaustion, then sticks until manual intervention.
  • Consequence: agents go silent for days. In our case, 3 days of no replies to scheduled telegram interactions and four daily cron jobs not firing.

Additional information

  • Related design discussion: Feature request: native Codex quota/auth diagnosis plus brokered reauth execution #54278 (proposes a quota_wait state separate from reauth_required). This bug is the concrete shape of one of the problems Feature request: native Codex quota/auth diagnosis plus brokered reauth execution #54278 describes.
  • Two suggested minimal fixes (either alone would have prevented this):
    1. Cap blockedUntil for subscription_limit reasons. Store min(reportedReset, now + MAX_SUBSCRIPTION_BLOCK_MS). With a cap of e.g. 1 hour, the profile gets re-probed an hour later; if still exhausted, the upstream returns the same error and the block is re-armed; if recovered, work resumes. Keep the reported timestamp in a separate expectedFullResetAt field for display only.
    2. Drop the hasFallbackCandidates short-circuit for recovery probes. Split shouldProbePrimaryDuringCooldown into "should we try a fallback now?" (legitimately needs fallback candidates) and "should we re-probe the primary now?" (doesn't). The recovery-probe branch should fire on any time-based throttle regardless of fallback configuration.
  • Workaround currently in place: hourly cron clearing blockedUntil for subscription_limit blocks where blockedUntil > now + 12h AND lastFailureAt < now - 6h.

Metadata

Metadata

Assignees

Labels

P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:auth-providerAuth, provider routing, model choice, or SecretRef resolution may break.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions