Skip to content

fix(agents): retry same model across short rate-limit windows#91911

Merged
vincentkoc merged 5 commits into
openclaw:mainfrom
lanzhi-lee:codex/rate-limit-same-model-retry
Jun 11, 2026
Merged

fix(agents): retry same model across short rate-limit windows#91911
vincentkoc merged 5 commits into
openclaw:mainfrom
lanzhi-lee:codex/rate-limit-same-model-retry

Conversation

@lanzhi-lee

Copy link
Copy Markdown
Contributor

Summary

  • Retry assistant-stage short-window rate limits on the same provider/model/profile before spending profile rotation or model fallback budget.
  • Use a bounded deterministic retry budget of 10s, 20s, and 30s so common minute-scale RPM/TPM windows can clear without adding unbounded latency.
  • Keep quota, usage-limit, subscription, daily/monthly, and bare quota-exceeded 429 failures on the existing escalation path; allow quota wording only when it carries explicit RPM/TPM/per-minute hints.

Verification

  • node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/helpers.test.ts src/agents/embedded-agent-runner/run/assistant-failover.test.ts
  • node scripts/run-oxlint.mjs src/agents/embedded-agent-runner/run.ts src/agents/embedded-agent-runner/run/assistant-failover.ts src/agents/embedded-agent-runner/run/assistant-failover.test.ts src/agents/embedded-agent-runner/run/helpers.ts src/agents/embedded-agent-runner/run/helpers.test.ts
  • git diff --check upstream/main..HEAD
  • .agents/skills/autoreview/scripts/autoreview --mode branch --base upstream/main — clean before the final conflict-free rebase; no review rerun after the final upstream rebase per maintainer instruction.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: M triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels Jun 10, 2026
@clawsweeper

clawsweeper Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed June 10, 2026, 10:30 AM ET / 14:30 UTC.

Summary
The PR adds three bounded same-provider/model retries with 10-, 20-, and 30-second delays for assistant-stage rate limits classified as short-window throttles, before existing auth-profile rotation or model fallback.

PR surface: Source +106, Tests +229. Total +335 across 5 files.

Reproducibility: yes. from source: pass a rate-limit assistant error containing Retry-After: 3600 through handleAssistantFailover; the branch classifies it as short-window and invokes the same-model retry path. Current main instead proceeds directly to profile/fallback handling.

Review metrics: none identified.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🦪 silver shellfish
Result: blocked until real behavior proof from a real setup is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P1] Fix long or ambiguous Retry-After classification and add its regression coverage.
  • Emit a distinct trace result for same-model rate-limit retries.
  • [P1] Add redacted real-provider proof for transient recovery and hard-quota escalation.

Proof guidance:

  • [P1] Needs real behavior proof before merge: Only focused tests, lint, diff checking, and autoreview are reported; before merge, add redacted terminal logs or live output showing a real short-window throttle recover on the same model and a hard or long-window quota response escalate without the added delay. Redact keys, account details, IP addresses, endpoints, and other private data. Updating the PR body should trigger a fresh ClawSweeper review; otherwise ask a maintainer to comment @clawsweeper re-review. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • [P2] Messages containing a long Retry-After interval can be misclassified as minute-scale throttles, delaying an otherwise usable profile or fallback model by the full 60-second retry budget.
  • [P1] The attempt trace claims a profile rotation for each same-model retry, which can corrupt operator diagnostics and any logic that interprets failover observations.
  • [P1] The policy changes provider and auth-profile routing across all agent runs without real-provider evidence showing both transient recovery and immediate escalation for hard quota failures.

Maintainer options:

  1. Tighten retry classification and tracing (recommended)
    Restrict retries to confirmed short windows, represent the outcome distinctly in traces, add focused coverage for long Retry-After values, and provide redacted live-provider proof.
  2. Accept heuristic delays intentionally
    Maintainers may accept message-based classification and up to 60 seconds of delayed fallback after reviewing real-provider evidence and the diagnostic trace implications.
  3. Preserve immediate failover
    Pause or close this PR if a cross-provider same-model delay should not become the default rate-limit policy.

Next step before merge

  • [P1] The contributor should repair the two concrete defects and provide real-provider proof; automation cannot supply evidence from the contributor’s provider setup.

Security
Cleared: The patch changes only agent failover logic and focused tests; it does not alter dependencies, secrets, permissions, downloads, build hooks, publishing, or other supply-chain surfaces.

Review findings

  • [P2] Do not treat every Retry-After value as a short window — src/agents/embedded-agent-runner/run/assistant-failover.ts:40
  • [P2] Record same-model retries separately from profile rotations — src/agents/embedded-agent-runner/run.ts:2922-2926
Review details

Best possible solution:

Retry the same model only when structured provider/runtime data or narrowly validated message text establishes a short throttle window, honor an available retry interval within a maintainer-approved bound, emit a distinct same-model-rate-limit observation, and retain immediate profile/model escalation for long-window or ambiguous quota failures.

Do we have a high-confidence way to reproduce the issue?

Yes, from source: pass a rate-limit assistant error containing Retry-After: 3600 through handleAssistantFailover; the branch classifies it as short-window and invokes the same-model retry path. Current main instead proceeds directly to profile/fallback handling.

Is this the best way to solve the issue?

No, not in its current form. Same-model retry is a reasonable repair direction, but generic retry-after matching and inaccurate failover tracing make this broader and less observable than the narrow maintainable solution.

Full review comments:

  • [P2] Do not treat every Retry-After value as a short window — src/agents/embedded-agent-runner/run/assistant-failover.ts:40
    retry-after is accepted as a short-window hint without parsing its duration. A provider error such as 429 quota exceeded; Retry-After: 3600 therefore waits and retries the same unavailable model three times before profile/model fallback. Parse and bound the interval, or require an explicit RPM/TPM/per-minute signal when no structured short interval is available.
    Confidence: 0.97
  • [P2] Record same-model retries separately from profile rotations — src/agents/embedded-agent-runner/run.ts:2922-2926
    The new retry kind falls through to result: "rotate_profile", although the path intentionally retains the same profile. This makes attempt traces report rotations that never occurred and obscures whether the new retry policy actually ran. Add a distinct trace result or otherwise preserve the real action.
    Confidence: 0.98

Overall correctness: patch is incorrect
Overall confidence: 0.97

AGENTS.md: found and applied where relevant.

Codex review notes: reasoning high; reviewed against 4ecec2f9e2f8.

Label changes

Label changes:

  • add rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🦪 silver shellfish.
  • add status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: Only focused tests, lint, diff checking, and autoreview are reported; before merge, add redacted terminal logs or live output showing a real short-window throttle recover on the same model and a hard or long-window quota response escalate without the added delay. Redact keys, account details, IP addresses, endpoints, and other private data. Updating the PR body should trigger a fresh ClawSweeper review; otherwise ask a maintainer to comment @clawsweeper re-review. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
  • remove rating: 🌊 off-meta tidepool: Current PR rating is rating: 🧂 unranked krab, so this older rating label is no longer current.

Label justifications:

  • P2: This is a normal-priority agent reliability fix with bounded but cross-provider behavior and two concrete repairable defects.
  • merge-risk: 🚨 auth-provider: Merging changes when rate-limit failures consume the current profile versus rotating credentials or selecting another model/provider.
  • merge-risk: 🚨 availability: A misclassified rate-limit response can keep a run on an unavailable model for up to 60 seconds before the existing fallback path executes.
  • rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🦪 silver shellfish.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: Only focused tests, lint, diff checking, and autoreview are reported; before merge, add redacted terminal logs or live output showing a real short-window throttle recover on the same model and a hard or long-window quota response escalate without the added delay. Redact keys, account details, IP addresses, endpoints, and other private data. Updating the PR body should trigger a fresh ClawSweeper review; otherwise ask a maintainer to comment @clawsweeper re-review. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
Evidence reviewed

PR surface:

Source +106, Tests +229. Total +335 across 5 files.

View PR surface stats
Area Files Added Removed Net
Source 3 107 1 +106
Tests 2 229 0 +229
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 5 336 1 +335

What I checked:

Likely related people:

  • brokemac79: Commit de4b8d8 introduced or rewrote the current assistant-failover module and its ownership boundary immediately before this PR. (role: recent area contributor; confidence: high; commits: de4b8d8ebf73; files: src/agents/embedded-agent-runner/run/assistant-failover.ts, src/agents/embedded-agent-runner/run.ts)
  • Vincent Koc: The latest shipped baseline commit is the other available provenance point for the failover path in this shallow checkout, and the PR timeline also explicitly mentions vincentkoc for review context. (role: adjacent runtime and release contributor; confidence: medium; commits: 5181e4f7c82b; files: src/agents/embedded-agent-runner/run.ts, src/agents/embedded-agent-runner/run/assistant-failover.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal backlog priority with limited blast radius. merge-risk: 🚨 auth-provider 🚨 May break OAuth, tokens, provider routing, model choice, or credentials. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. labels Jun 10, 2026
changquanyou and others added 4 commits June 10, 2026 19:17
Port the rate-limit same-model retry subset from internal commit b954516f29a.

Provider RPM caps are commonly minute-scale, so wait out the current
provider/model/profile window before spending a profile rotation or model
fallback. The retry budget is bounded and the existing profile/fallback path
runs once it is exhausted.

(cherry picked from commit b954516f29ac39ac178c7bf0e0818443e8f183ca)
Port the same-model rate-limit backoff tuning from internal commit 8590a90bde0.

Use three deterministic linear retries (10s, 20s, 30s) so the bounded retry
budget spans roughly one RPM window while giving early recovery a faster first
attempt.

(cherry picked from commit 8590a90bde0586ceb50dfa4a3b19f5055329e98d)
@lanzhi-lee lanzhi-lee force-pushed the codex/rate-limit-same-model-retry branch from 1908d7f to 0841b69 Compare June 10, 2026 11:23
@lanzhi-lee

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. labels Jun 10, 2026
@clawsweeper clawsweeper Bot added rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. labels Jun 10, 2026
@vincentkoc vincentkoc self-assigned this Jun 11, 2026
@vincentkoc

Copy link
Copy Markdown
Member

Maintainer fix pushed to this PR branch.

Summary:

  • Bounded same-model rate-limit retry classification to explicit short-window evidence: RPM/TPM/per-minute/frequency signals or parsed short Retry-After values.
  • Honored parsed Retry-After delays in the actual same-model sleep, capped to the existing short-window retry ceiling.
  • Preserved rateLimitedProfileRotations: 0 semantics and generic rate limit exceeded profile-rotation/fallback behavior.
  • Recorded same-model rate-limit retries as same_model_rate_limit and kept fallbackUsed false for pure same-model retries.

Verification:

  • node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/assistant-failover.test.ts src/agents/embedded-agent-runner/run/helpers.test.ts
  • Azure Crabbox cbx_bdb5a7807a1f / coral-shrimp: OPENCLAW_CHECK_CHANGED_REMOTE_CHILD=1 OPENCLAW_CHANGED_LANES_RAW_SYNC=1 corepack pnpm check:changed passed.
  • Autoreview: clean, no accepted/actionable findings.

GitHub reports this PR is mergeable and has no status checks attached to the current head.

@vincentkoc vincentkoc merged commit 9a6c71a into openclaw:main Jun 11, 2026
156 of 167 checks passed
wangmiao0668000666 pushed a commit to wangmiao0668000666/openclaw that referenced this pull request Jun 12, 2026
…aw#91911)

Bound same-model rate-limit retries to explicit short-window signals or parsed short Retry-After values, honor Retry-After in the retry sleep, preserve zero-rotation fallback behavior, and record same-model rate-limit retries separately from profile rotations.

Verification:
- node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/assistant-failover.test.ts src/agents/embedded-agent-runner/run/helpers.test.ts
- Azure Crabbox cbx_bdb5a7807a1f / coral-shrimp: OPENCLAW_CHECK_CHANGED_REMOTE_CHILD=1 OPENCLAW_CHANGED_LANES_RAW_SYNC=1 corepack pnpm check:changed
- .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main
wangmiao0668000666 pushed a commit to wangmiao0668000666/openclaw that referenced this pull request Jun 12, 2026
…aw#91911)

Bound same-model rate-limit retries to explicit short-window signals or parsed short Retry-After values, honor Retry-After in the retry sleep, preserve zero-rotation fallback behavior, and record same-model rate-limit retries separately from profile rotations.

Verification:
- node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/assistant-failover.test.ts src/agents/embedded-agent-runner/run/helpers.test.ts
- Azure Crabbox cbx_bdb5a7807a1f / coral-shrimp: OPENCLAW_CHECK_CHANGED_REMOTE_CHILD=1 OPENCLAW_CHANGED_LANES_RAW_SYNC=1 corepack pnpm check:changed
- .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main
wangmiao0668000666 pushed a commit to wangmiao0668000666/openclaw that referenced this pull request Jun 12, 2026
…aw#91911)

Bound same-model rate-limit retries to explicit short-window signals or parsed short Retry-After values, honor Retry-After in the retry sleep, preserve zero-rotation fallback behavior, and record same-model rate-limit retries separately from profile rotations.

Verification:
- node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/assistant-failover.test.ts src/agents/embedded-agent-runner/run/helpers.test.ts
- Azure Crabbox cbx_bdb5a7807a1f / coral-shrimp: OPENCLAW_CHECK_CHANGED_REMOTE_CHILD=1 OPENCLAW_CHANGED_LANES_RAW_SYNC=1 corepack pnpm check:changed
- .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request Jun 12, 2026
…aw#91911)

Bound same-model rate-limit retries to explicit short-window signals or parsed short Retry-After values, honor Retry-After in the retry sleep, preserve zero-rotation fallback behavior, and record same-model rate-limit retries separately from profile rotations.

Verification:
- node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/assistant-failover.test.ts src/agents/embedded-agent-runner/run/helpers.test.ts
- Azure Crabbox cbx_bdb5a7807a1f / coral-shrimp: OPENCLAW_CHECK_CHANGED_REMOTE_CHILD=1 OPENCLAW_CHANGED_LANES_RAW_SYNC=1 corepack pnpm check:changed
- .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling merge-risk: 🚨 auth-provider 🚨 May break OAuth, tokens, provider routing, model choice, or credentials. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. P2 Normal backlog priority with limited blast radius. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. size: L status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants