feat(dispatch): crates.io-aware backoff for new-crate rate limits (#94)#114
Conversation
Before: generic exponential backoff on retry, starting at opts.base_delay
and doubling. For a new-crate publish hitting crates.io's documented
10-minute rate-limit window, this burned ~5 guaranteed-to-fail attempts
(10s, 20s, 40s, 80s, 160s) before the window could plausibly have cleared.
The v0.3.0-rc.1 retrospective estimated 35 of 41 retries were preventable
for exactly this reason.
After: when a retry's error message looks like a rate-limit signal AND the
crate is brand-new, floor the delay at 10 minutes. Everything else keeps
the generic exponential behavior.
## New helpers (runtime/execution)
- `CRATES_IO_NEW_CRATE_WINDOW = 10 * 60s` — documented crates.io constant
- `looks_like_rate_limit(msg)` — detects "429" / "rate limit" / "too many
requests" phrasings in cargo stderr
- `registry_aware_backoff(base, max, attempt, strategy, jitter,
is_new_crate, error_message)` — wraps `backoff_delay`, floors at
CRATES_IO_NEW_CRATE_WINDOW only when both `is_new_crate` and
`looks_like_rate_limit(msg)` are true
## Wire-in (lazy, no happy-path regressions)
Both the parallel and sequential retry loops now:
1. Use `registry_aware_backoff` instead of `backoff_delay` on the
Retryable/Ambiguous branch
2. Only query `crate_exists` / `check_new_crate` when the error message
matches a rate-limit phrasing — zero extra network calls on the happy
path or on generic network-blip retries
3. Cache the is_new_crate answer per-package via `Option<bool>`, so
subsequent retries of the same crate don't re-query
## Test plan
- [x] `cargo check` + `cargo clippy --all-targets -- -D warnings` clean
- [x] `cargo fmt --check` clean
- [x] 5 new unit tests in runtime::execution covering:
- `looks_like_rate_limit` positive and negative phrasings
- `registry_aware_backoff` extends for new-crate + rate-limit
- `registry_aware_backoff` UNCHANGED for existing-crate + rate-limit
- `registry_aware_backoff` UNCHANGED for new-crate + non-rate-limit
- `registry_aware_backoff` respects longer generic delay when it exceeds window
- [x] 192 engine tests pass (no regressions after lazy-detection refactor;
initial eager version added an extra registry request that test mocks
weren't provisioned for — lazy path fixed that)
- [x] shipper-types (268), state::consistency (9), reconcile (2) all green
## Expected real-world impact
On the next first-publish wave (hypothetical v0.3.0-rc.2 or v0.3.1
first-publish), rate-limited retries of brand-new crates skip the wasted
short backoffs and wait ~10 minutes directly. Projected ~35-retry savings
per the v0.3.0-rc.1 pattern.
## Out of scope
- `Retry-After` HTTP header parsing (requires plumbing cargo stderr into
a structured response) — follow-up #94 PR 2
- Full `RegistryProfile` abstraction — follow-up #94 PR 3
- Propagating a `PublishRegime` enum through PlannedPackage — not strictly
needed for this win; lazy detection at retry time is sufficient
## Related
- Parent issue: #94 (Dispatch)
- Competency umbrella: #106 (Profile/Adapt — was Partial, now Partial+)
- Master roadmap: #109 (Now block item 4)
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 19 minutes and 35 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (3)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8bd2a23545
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ErrorClass::Retryable | ErrorClass::Ambiguous => { | ||
| let delay = backoff_delay( | ||
| let is_new_crate = | ||
| if crate::runtime::execution::looks_like_rate_limit(&msg) { |
There was a problem hiding this comment.
Detect rate limits from cargo stderr, not classifier summary
looks_like_rate_limit(&msg) is checking the normalized classifier message (msg from classify_cargo_failure) rather than the raw cargo output, so this condition never sees strings like 429/too many requests and the crates.io-aware 10-minute floor is effectively disabled. In practice, retryable failures currently produce generic summaries (e.g. transient failure text), so is_new_crate stays false and this optimization never activates; the same pattern is also present in the parallel publish path.
Useful? React with 👍 / 👎.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
… follow-on) Closes the remaining safety gap from #111: when a prior run left a package in PackageState::Ambiguous (reconciliation was inconclusive at that time), a subsequent `shipper resume` previously fell through to blind cargo retry. That risked re-uploading a crate whose original upload may have actually succeeded. ## What this adds Both the sequential (engine::mod) and parallel (engine::parallel::publish) publish paths now detect `PackageState::Ambiguous` BEFORE any cargo activity and reconcile against registry truth first. Outcomes follow the same state machine as in-flight reconciliation: - **Published** → mark Published, skip any further cargo work for this crate, continue/return success receipt with no republish. - **NotPublished** → clear Ambiguous back to Pending (registry confirms no prior upload), fall through to the normal publish flow. - **StillUnknown** → emit PublishFailed webhook + halt with an operator-actionable error; do NOT retry. ## Parallel + sequential symmetry - Sequential: new `sequential_reconcile` helper in engine::mod wraps the full `RegistryClient::is_version_visible_with_backoff` and returns the same `(ReconciliationOutcome, Vec<ReadinessEvidence>)` shape as the parallel path's helper. Called from the per-package state-check match arm. - Parallel: reuses existing `reconcile_ambiguous_upload` from engine::parallel::reconcile. Called just before the retry-loop entry, returning a completed PackageResult early on Published or StillUnknown. Both paths emit `PublishReconciling` and `PublishReconciled` events so the event stream records the resume-time reconciliation alongside any in-flight reconciliation from earlier runs. ## Test plan - [x] `cargo clippy --all-targets -- -D warnings` clean - [x] `cargo fmt --check` clean - [x] 192 engine tests pass (no regressions) BDD scenarios for resume-path reconciliation across all three outcomes are a natural follow-up; the handler shape mirrors the in-flight path so regressions here surface through existing integration tests that already cover the outcome shape. ## Related - Parent: #99 Reconcile (the biggest non-cosmetic safety gap from the v0.3.0-rc.1 retrospective) - Sibling PRs already merged: #111 (in-flight reconcile), #112 (events- as-truth), #113 (retry narration), #114 (registry-aware backoff) - Master roadmap: #109
… follow-on) (#115) Closes the remaining safety gap from #111: when a prior run left a package in PackageState::Ambiguous (reconciliation was inconclusive at that time), a subsequent `shipper resume` previously fell through to blind cargo retry. That risked re-uploading a crate whose original upload may have actually succeeded. ## What this adds Both the sequential (engine::mod) and parallel (engine::parallel::publish) publish paths now detect `PackageState::Ambiguous` BEFORE any cargo activity and reconcile against registry truth first. Outcomes follow the same state machine as in-flight reconciliation: - **Published** → mark Published, skip any further cargo work for this crate, continue/return success receipt with no republish. - **NotPublished** → clear Ambiguous back to Pending (registry confirms no prior upload), fall through to the normal publish flow. - **StillUnknown** → emit PublishFailed webhook + halt with an operator-actionable error; do NOT retry. ## Parallel + sequential symmetry - Sequential: new `sequential_reconcile` helper in engine::mod wraps the full `RegistryClient::is_version_visible_with_backoff` and returns the same `(ReconciliationOutcome, Vec<ReadinessEvidence>)` shape as the parallel path's helper. Called from the per-package state-check match arm. - Parallel: reuses existing `reconcile_ambiguous_upload` from engine::parallel::reconcile. Called just before the retry-loop entry, returning a completed PackageResult early on Published or StillUnknown. Both paths emit `PublishReconciling` and `PublishReconciled` events so the event stream records the resume-time reconciliation alongside any in-flight reconciliation from earlier runs. ## Test plan - [x] `cargo clippy --all-targets -- -D warnings` clean - [x] `cargo fmt --check` clean - [x] 192 engine tests pass (no regressions) BDD scenarios for resume-path reconciliation across all three outcomes are a natural follow-up; the handler shape mirrors the in-flight path so regressions here surface through existing integration tests that already cover the outcome shape. ## Related - Parent: #99 Reconcile (the biggest non-cosmetic safety gap from the v0.3.0-rc.1 retrospective) - Sibling PRs already merged: #111 (in-flight reconcile), #112 (events- as-truth), #113 (retry narration), #114 (registry-aware backoff) - Master roadmap: #109
Closes #94 PR 1 — the retry efficiency gap from v0.3.0-rc.1.
Problem
Generic exponential backoff from `opts.base_delay` wastes ~5 attempts when a new-crate publish hits crates.io's documented 10-minute new-crate rate-limit window. The v0.3.0-rc.1 retrospective estimated 35 of 41 retries were preventable for exactly this reason.
Solution
Add a layered `registry_aware_backoff`: if a retry's error message looks like a rate-limit AND the crate is brand-new, floor the delay at the documented 10-minute window. Everything else keeps generic exponential.
New helpers (`runtime/execution`)
Wire-in (both paths, lazy detection)
Both parallel and sequential retry loops replace `backoff_delay` with `registry_aware_backoff`, but:
Test plan
An initial eager version of this change added an extra registry request at publish-start that broke 32 existing tests whose mock servers weren't provisioned for it. The lazy-detection refactor fixed that cleanly.
Expected impact
On the next first-publish wave, rate-limited retries of brand-new crates skip the wasted short backoffs and go directly to ~10-minute waits. Projected ~35-retry savings per the v0.3.0-rc.1 pattern. Narration from #91 makes the wait visible, so the operator sees one big wait instead of six small misleading ones.
Out of scope
Related