fix(agents): retry same model across short rate-limit windows by lanzhi-lee · Pull Request #91911 · openclaw/openclaw

lanzhi-lee · 2026-06-10T09:30:42Z

Summary

Retry assistant-stage short-window rate limits on the same provider/model/profile before spending profile rotation or model fallback budget.
Use a bounded deterministic retry budget of 10s, 20s, and 30s so common minute-scale RPM/TPM windows can clear without adding unbounded latency.
Keep quota, usage-limit, subscription, daily/monthly, and bare quota-exceeded 429 failures on the existing escalation path; allow quota wording only when it carries explicit RPM/TPM/per-minute hints.

Verification

node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/helpers.test.ts src/agents/embedded-agent-runner/run/assistant-failover.test.ts
node scripts/run-oxlint.mjs src/agents/embedded-agent-runner/run.ts src/agents/embedded-agent-runner/run/assistant-failover.ts src/agents/embedded-agent-runner/run/assistant-failover.test.ts src/agents/embedded-agent-runner/run/helpers.ts src/agents/embedded-agent-runner/run/helpers.test.ts
git diff --check upstream/main..HEAD
.agents/skills/autoreview/scripts/autoreview --mode branch --base upstream/main — clean before the final conflict-free rebase; no review rerun after the final upstream rebase per maintainer instruction.

clawsweeper · 2026-06-10T09:33:34Z

Codex review: needs real behavior proof before merge. Reviewed June 10, 2026, 10:30 AM ET / 14:30 UTC.

Summary
The PR adds three bounded same-provider/model retries with 10-, 20-, and 30-second delays for assistant-stage rate limits classified as short-window throttles, before existing auth-profile rotation or model fallback.

PR surface: Source +106, Tests +229. Total +335 across 5 files.

Reproducibility: yes. from source: pass a rate-limit assistant error containing Retry-After: 3600 through handleAssistantFailover; the branch classifies it as short-window and invokes the same-model retry path. Current main instead proceeds directly to profile/fallback handling.

Review metrics: none identified.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🦪 silver shellfish
Result: blocked until real behavior proof from a real setup is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

[P1] Fix long or ambiguous Retry-After classification and add its regression coverage.
Emit a distinct trace result for same-model rate-limit retries.
[P1] Add redacted real-provider proof for transient recovery and hard-quota escalation.

Proof guidance:

[P1] Needs real behavior proof before merge: Only focused tests, lint, diff checking, and autoreview are reported; before merge, add redacted terminal logs or live output showing a real short-window throttle recover on the same model and a hard or long-window quota response escalate without the added delay. Redact keys, account details, IP addresses, endpoints, and other private data. Updating the PR body should trigger a fresh ClawSweeper review; otherwise ask a maintainer to comment @clawsweeper re-review. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

[P2] Messages containing a long Retry-After interval can be misclassified as minute-scale throttles, delaying an otherwise usable profile or fallback model by the full 60-second retry budget.
[P1] The attempt trace claims a profile rotation for each same-model retry, which can corrupt operator diagnostics and any logic that interprets failover observations.
[P1] The policy changes provider and auth-profile routing across all agent runs without real-provider evidence showing both transient recovery and immediate escalation for hard quota failures.

Maintainer options:

Tighten retry classification and tracing (recommended)
Restrict retries to confirmed short windows, represent the outcome distinctly in traces, add focused coverage for long Retry-After values, and provide redacted live-provider proof.
Accept heuristic delays intentionally
Maintainers may accept message-based classification and up to 60 seconds of delayed fallback after reviewing real-provider evidence and the diagnostic trace implications.
Preserve immediate failover
Pause or close this PR if a cross-provider same-model delay should not become the default rate-limit policy.

Next step before merge

[P1] The contributor should repair the two concrete defects and provide real-provider proof; automation cannot supply evidence from the contributor’s provider setup.

Security
Cleared: The patch changes only agent failover logic and focused tests; it does not alter dependencies, secrets, permissions, downloads, build hooks, publishing, or other supply-chain surfaces.

Review findings

[P2] Do not treat every Retry-After value as a short window — src/agents/embedded-agent-runner/run/assistant-failover.ts:40
[P2] Record same-model retries separately from profile rotations — src/agents/embedded-agent-runner/run.ts:2922-2926

Review details

Best possible solution:

Retry the same model only when structured provider/runtime data or narrowly validated message text establishes a short throttle window, honor an available retry interval within a maintainer-approved bound, emit a distinct same-model-rate-limit observation, and retain immediate profile/model escalation for long-window or ambiguous quota failures.

Do we have a high-confidence way to reproduce the issue?

Yes, from source: pass a rate-limit assistant error containing Retry-After: 3600 through handleAssistantFailover; the branch classifies it as short-window and invokes the same-model retry path. Current main instead proceeds directly to profile/fallback handling.

Is this the best way to solve the issue?

No, not in its current form. Same-model retry is a reasonable repair direction, but generic retry-after matching and inaccurate failover tracing make this broader and less observable than the narrow maintainable solution.

Full review comments:

[P2] Do not treat every Retry-After value as a short window — src/agents/embedded-agent-runner/run/assistant-failover.ts:40
retry-after is accepted as a short-window hint without parsing its duration. A provider error such as 429 quota exceeded; Retry-After: 3600 therefore waits and retries the same unavailable model three times before profile/model fallback. Parse and bound the interval, or require an explicit RPM/TPM/per-minute signal when no structured short interval is available.
Confidence: 0.97
[P2] Record same-model retries separately from profile rotations — src/agents/embedded-agent-runner/run.ts:2922-2926
The new retry kind falls through to result: "rotate_profile", although the path intentionally retains the same profile. This makes attempt traces report rotations that never occurred and obscures whether the new retry policy actually ran. Add a distinct trace result or otherwise preserve the real action.
Confidence: 0.98

Overall correctness: patch is incorrect
Overall confidence: 0.97

AGENTS.md: found and applied where relevant.

Codex review notes: reasoning high; reviewed against 4ecec2f9e2f8.

Label changes

Label changes:

add rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🦪 silver shellfish.
add status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: Only focused tests, lint, diff checking, and autoreview are reported; before merge, add redacted terminal logs or live output showing a real short-window throttle recover on the same model and a hard or long-window quota response escalate without the added delay. Redact keys, account details, IP addresses, endpoints, and other private data. Updating the PR body should trigger a fresh ClawSweeper review; otherwise ask a maintainer to comment @clawsweeper re-review. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
remove rating: 🌊 off-meta tidepool: Current PR rating is rating: 🧂 unranked krab, so this older rating label is no longer current.

Label justifications:

P2: This is a normal-priority agent reliability fix with bounded but cross-provider behavior and two concrete repairable defects.
merge-risk: 🚨 auth-provider: Merging changes when rate-limit failures consume the current profile versus rotating credentials or selecting another model/provider.
merge-risk: 🚨 availability: A misclassified rate-limit response can keep a run on an unavailable model for up to 60 seconds before the existing fallback path executes.
rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🦪 silver shellfish.
status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: Only focused tests, lint, diff checking, and autoreview are reported; before merge, add redacted terminal logs or live output showing a real short-window throttle recover on the same model and a hard or long-window quota response escalate without the added delay. Redact keys, account details, IP addresses, endpoints, and other private data. Updating the PR body should trigger a fresh ClawSweeper review; otherwise ask a maintainer to comment @clawsweeper re-review. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Evidence reviewed

PR surface:

Source +106, Tests +229. Total +335 across 5 files.

View PR surface stats

Area	Files	Added	Removed	Net
Source	3	107	1	+106
Tests	2	229	0	+229
Docs	0	0	0	0
Config	0	0	0	0
Generated	0	0	0	0
Other	0	0	0	0
Total	5	336	1	+335

What I checked:

Overbroad short-window classification: The classifier treats any retry-after wording as a short-window hint, so errors instructing the client to retry after an hour or another long interval still incur three same-model retries before normal fallback. (src/agents/embedded-agent-runner/run/assistant-failover.ts:40, 0841b694cac7)
Incorrect failover observation: The trace expression maps the new same_model_rate_limit outcome to rotate_profile, even though this path deliberately keeps the same provider, model, and profile. (src/agents/embedded-agent-runner/run.ts:2922, 0841b694cac7)
Bounded retry state: The branch caps same-model retries at three, sleeps through the run abort signal, preserves the count only for consecutive rate-limit retries, and resets it on other outcomes. (src/agents/embedded-agent-runner/run.ts:1379, 0841b694cac7)
Existing main behavior: Current main immediately proceeds through rate-limit profile escalation and auth-profile rotation; it does not wait and retry the same model. (src/agents/embedded-agent-runner/run/assistant-failover.ts:156, 4ecec2f9e2f8)
Verification and proof: The PR body reports focused Vitest, oxlint, diff-check, and autoreview runs, but contains no after-fix output from an actual provider or equivalent real environment. (0841b694cac7)
Recent module provenance: The current assistant-failover module was introduced or rewritten by the recent trusted-policy runtime refactor, making that commit the strongest available current-main ownership signal in this shallow checkout. (src/agents/embedded-agent-runner/run/assistant-failover.ts:38, de4b8d8ebf73)

Likely related people:

brokemac79: Commit de4b8d8 introduced or rewrote the current assistant-failover module and its ownership boundary immediately before this PR. (role: recent area contributor; confidence: high; commits: de4b8d8ebf73; files: src/agents/embedded-agent-runner/run/assistant-failover.ts, src/agents/embedded-agent-runner/run.ts)
Vincent Koc: The latest shipped baseline commit is the other available provenance point for the failover path in this shallow checkout, and the PR timeline also explicitly mentions vincentkoc for review context. (role: adjacent runtime and release contributor; confidence: medium; commits: 5181e4f7c82b; files: src/agents/embedded-agent-runner/run.ts, src/agents/embedded-agent-runner/run/assistant-failover.ts)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Port the rate-limit same-model retry subset from internal commit b954516f29a. Provider RPM caps are commonly minute-scale, so wait out the current provider/model/profile window before spending a profile rotation or model fallback. The retry budget is bounded and the existing profile/fallback path runs once it is exhausted. (cherry picked from commit b954516f29ac39ac178c7bf0e0818443e8f183ca)

Port the same-model rate-limit backoff tuning from internal commit 8590a90bde0. Use three deterministic linear retries (10s, 20s, 30s) so the bounded retry budget spans roughly one RPM window while giving early recovery a faster first attempt. (cherry picked from commit 8590a90bde0586ceb50dfa4a3b19f5055329e98d)

lanzhi-lee · 2026-06-10T11:33:55Z

@clawsweeper re-review

clawsweeper · 2026-06-10T11:33:58Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/27273376059
Updated: 2026-06-10T11:40:04.226Z

vincentkoc · 2026-06-11T12:00:42Z

Maintainer fix pushed to this PR branch.

Summary:

Bounded same-model rate-limit retry classification to explicit short-window evidence: RPM/TPM/per-minute/frequency signals or parsed short Retry-After values.
Honored parsed Retry-After delays in the actual same-model sleep, capped to the existing short-window retry ceiling.
Preserved rateLimitedProfileRotations: 0 semantics and generic rate limit exceeded profile-rotation/fallback behavior.
Recorded same-model rate-limit retries as same_model_rate_limit and kept fallbackUsed false for pure same-model retries.

Verification:

node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/assistant-failover.test.ts src/agents/embedded-agent-runner/run/helpers.test.ts
Azure Crabbox cbx_bdb5a7807a1f / coral-shrimp: OPENCLAW_CHECK_CHANGED_REMOTE_CHILD=1 OPENCLAW_CHANGED_LANES_RAW_SYNC=1 corepack pnpm check:changed passed.
Autoreview: clean, no accepted/actionable findings.

GitHub reports this PR is mergeable and has no status checks attached to the current head.

…aw#91911) Bound same-model rate-limit retries to explicit short-window signals or parsed short Retry-After values, honor Retry-After in the retry sleep, preserve zero-rotation fallback behavior, and record same-model rate-limit retries separately from profile rotations. Verification: - node scripts/run-vitest.mjs src/agents/embedded-agent-runner/run/assistant-failover.test.ts src/agents/embedded-agent-runner/run/helpers.test.ts - Azure Crabbox cbx_bdb5a7807a1f / coral-shrimp: OPENCLAW_CHECK_CHANGED_REMOTE_CHILD=1 OPENCLAW_CHANGED_LANES_RAW_SYNC=1 corepack pnpm check:changed - .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main

openclaw-barnacle Bot added agents Agent runtime and tooling size: M triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels Jun 10, 2026

changquanyou and others added 4 commits June 10, 2026 19:17

fix(agents): restrict retries to short rate windows

5d9bc1a

fix(agents): reset rate-limit retry budget after recovery

0841b69

lanzhi-lee force-pushed the codex/rate-limit-same-model-retry branch from 1908d7f to 0841b69 Compare June 10, 2026 11:23

BradGroux mentioned this pull request Jun 10, 2026

WORKING: All Microsoft Issues and PRs (refresh) #74163

Draft

vincentkoc self-assigned this Jun 11, 2026

fix(agents): bound same-model rate-limit retries

879e2bd

openclaw-barnacle Bot added size: L and removed size: M labels Jun 11, 2026

vincentkoc merged commit 9a6c71a into openclaw:main Jun 11, 2026
156 of 167 checks passed

github-actions Bot mentioned this pull request Jun 11, 2026

📡 Upstream Digest — 2026-06-11 13:52 UTC curtismercier/openclaw-mods#1057

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(agents): retry same model across short rate-limit windows#91911

fix(agents): retry same model across short rate-limit windows#91911
vincentkoc merged 5 commits into
openclaw:mainfrom
lanzhi-lee:codex/rate-limit-same-model-retry

lanzhi-lee commented Jun 10, 2026

Uh oh!

clawsweeper Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

lanzhi-lee commented Jun 10, 2026

Uh oh!

clawsweeper Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

vincentkoc commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lanzhi-lee commented Jun 10, 2026

Summary

Verification

Uh oh!

clawsweeper Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lanzhi-lee commented Jun 10, 2026

Uh oh!

clawsweeper Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vincentkoc commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clawsweeper Bot commented Jun 10, 2026 •

edited

Loading

clawsweeper Bot commented Jun 10, 2026 •

edited

Loading