Skip to content

feat(dispatch): crates.io-aware backoff for new-crate rate limits (#94)#114

Merged
EffortlessSteven merged 1 commit into
mainfrom
feat/94-registry-aware-backoff
Apr 17, 2026
Merged

feat(dispatch): crates.io-aware backoff for new-crate rate limits (#94)#114
EffortlessSteven merged 1 commit into
mainfrom
feat/94-registry-aware-backoff

Conversation

@EffortlessSteven

Copy link
Copy Markdown
Member

Closes #94 PR 1 — the retry efficiency gap from v0.3.0-rc.1.

Problem

Generic exponential backoff from `opts.base_delay` wastes ~5 attempts when a new-crate publish hits crates.io's documented 10-minute new-crate rate-limit window. The v0.3.0-rc.1 retrospective estimated 35 of 41 retries were preventable for exactly this reason.

Solution

Add a layered `registry_aware_backoff`: if a retry's error message looks like a rate-limit AND the crate is brand-new, floor the delay at the documented 10-minute window. Everything else keeps generic exponential.

New helpers (`runtime/execution`)

  • `CRATES_IO_NEW_CRATE_WINDOW = 10 * 60s` — documented constant from crates.io rate-limit docs
  • `looks_like_rate_limit(msg)` — detects "429", "rate limit", "too many requests" phrasings in cargo stderr
  • `registry_aware_backoff(..., is_new_crate, error_message)` — wraps `backoff_delay`, floors at the window only when both conditions match

Wire-in (both paths, lazy detection)

Both parallel and sequential retry loops replace `backoff_delay` with `registry_aware_backoff`, but:

  1. `crate_exists` / `check_new_crate` is only queried when the error message matches a rate-limit phrasing → zero extra network calls on the happy path or on generic network-blip retries
  2. The answer is cached per-package via `Option` → subsequent retries don't re-query

Test plan

  • `cargo clippy --all-targets -- -D warnings` clean
  • `cargo fmt --check` clean
  • 5 new unit tests:
    • `looks_like_rate_limit` positive + negative phrasings
    • extends for new-crate + rate-limit (floored at 10 min)
    • UNCHANGED for existing-crate + rate-limit (higher per-minute budget)
    • UNCHANGED for new-crate + non-rate-limit (don't wait 10 min for a network blip)
    • respects longer generic delay when it exceeds the window
  • 192 engine tests pass (no regressions)
  • shipper-types (268), state::consistency (9), reconcile (2) all green

An initial eager version of this change added an extra registry request at publish-start that broke 32 existing tests whose mock servers weren't provisioned for it. The lazy-detection refactor fixed that cleanly.

Expected impact

On the next first-publish wave, rate-limited retries of brand-new crates skip the wasted short backoffs and go directly to ~10-minute waits. Projected ~35-retry savings per the v0.3.0-rc.1 pattern. Narration from #91 makes the wait visible, so the operator sees one big wait instead of six small misleading ones.

Out of scope

Related

Before: generic exponential backoff on retry, starting at opts.base_delay
and doubling. For a new-crate publish hitting crates.io's documented
10-minute rate-limit window, this burned ~5 guaranteed-to-fail attempts
(10s, 20s, 40s, 80s, 160s) before the window could plausibly have cleared.
The v0.3.0-rc.1 retrospective estimated 35 of 41 retries were preventable
for exactly this reason.

After: when a retry's error message looks like a rate-limit signal AND the
crate is brand-new, floor the delay at 10 minutes. Everything else keeps
the generic exponential behavior.

## New helpers (runtime/execution)

- `CRATES_IO_NEW_CRATE_WINDOW = 10 * 60s` — documented crates.io constant
- `looks_like_rate_limit(msg)` — detects "429" / "rate limit" / "too many
  requests" phrasings in cargo stderr
- `registry_aware_backoff(base, max, attempt, strategy, jitter,
  is_new_crate, error_message)` — wraps `backoff_delay`, floors at
  CRATES_IO_NEW_CRATE_WINDOW only when both `is_new_crate` and
  `looks_like_rate_limit(msg)` are true

## Wire-in (lazy, no happy-path regressions)

Both the parallel and sequential retry loops now:

1. Use `registry_aware_backoff` instead of `backoff_delay` on the
   Retryable/Ambiguous branch
2. Only query `crate_exists` / `check_new_crate` when the error message
   matches a rate-limit phrasing — zero extra network calls on the happy
   path or on generic network-blip retries
3. Cache the is_new_crate answer per-package via `Option<bool>`, so
   subsequent retries of the same crate don't re-query

## Test plan

- [x] `cargo check` + `cargo clippy --all-targets -- -D warnings` clean
- [x] `cargo fmt --check` clean
- [x] 5 new unit tests in runtime::execution covering:
  - `looks_like_rate_limit` positive and negative phrasings
  - `registry_aware_backoff` extends for new-crate + rate-limit
  - `registry_aware_backoff` UNCHANGED for existing-crate + rate-limit
  - `registry_aware_backoff` UNCHANGED for new-crate + non-rate-limit
  - `registry_aware_backoff` respects longer generic delay when it exceeds window
- [x] 192 engine tests pass (no regressions after lazy-detection refactor;
      initial eager version added an extra registry request that test mocks
      weren't provisioned for — lazy path fixed that)
- [x] shipper-types (268), state::consistency (9), reconcile (2) all green

## Expected real-world impact

On the next first-publish wave (hypothetical v0.3.0-rc.2 or v0.3.1
first-publish), rate-limited retries of brand-new crates skip the wasted
short backoffs and wait ~10 minutes directly. Projected ~35-retry savings
per the v0.3.0-rc.1 pattern.

## Out of scope

- `Retry-After` HTTP header parsing (requires plumbing cargo stderr into
  a structured response) — follow-up #94 PR 2
- Full `RegistryProfile` abstraction — follow-up #94 PR 3
- Propagating a `PublishRegime` enum through PlannedPackage — not strictly
  needed for this win; lazy detection at retry time is sufficient

## Related

- Parent issue: #94 (Dispatch)
- Competency umbrella: #106 (Profile/Adapt — was Partial, now Partial+)
- Master roadmap: #109 (Now block item 4)
@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai

coderabbitai Bot commented Apr 17, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@EffortlessSteven has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 19 minutes and 35 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 19 minutes and 35 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8e34bded-b1d3-4fc4-99f2-f7630dedce15

📥 Commits

Reviewing files that changed from the base of the PR and between dbfb76c and 8bd2a23.

📒 Files selected for processing (3)
  • crates/shipper/src/engine/mod.rs
  • crates/shipper/src/engine/parallel/publish.rs
  • crates/shipper/src/runtime/execution/mod.rs
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/94-registry-aware-backoff

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8bd2a23545

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ErrorClass::Retryable | ErrorClass::Ambiguous => {
let delay = backoff_delay(
let is_new_crate =
if crate::runtime::execution::looks_like_rate_limit(&msg) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Detect rate limits from cargo stderr, not classifier summary

looks_like_rate_limit(&msg) is checking the normalized classifier message (msg from classify_cargo_failure) rather than the raw cargo output, so this condition never sees strings like 429/too many requests and the crates.io-aware 10-minute floor is effectively disabled. In practice, retryable failures currently produce generic summaries (e.g. transient failure text), so is_new_crate stays false and this optimization never activates; the same pattern is also present in the parallel publish path.

Useful? React with 👍 / 👎.

@codecov

codecov Bot commented Apr 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.56522% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/shipper/src/engine/mod.rs 70.00% 3 Missing ⚠️
crates/shipper/src/engine/parallel/publish.rs 75.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@EffortlessSteven EffortlessSteven merged commit 4ee237f into main Apr 17, 2026
20 of 21 checks passed
@EffortlessSteven EffortlessSteven deleted the feat/94-registry-aware-backoff branch April 17, 2026 05:31
EffortlessSteven added a commit that referenced this pull request Apr 17, 2026
… follow-on)

Closes the remaining safety gap from #111: when a prior run left a package
in PackageState::Ambiguous (reconciliation was inconclusive at that time),
a subsequent `shipper resume` previously fell through to blind cargo retry.
That risked re-uploading a crate whose original upload may have actually
succeeded.

## What this adds

Both the sequential (engine::mod) and parallel (engine::parallel::publish)
publish paths now detect `PackageState::Ambiguous` BEFORE any cargo
activity and reconcile against registry truth first.

Outcomes follow the same state machine as in-flight reconciliation:

- **Published** → mark Published, skip any further cargo work for this
  crate, continue/return success receipt with no republish.
- **NotPublished** → clear Ambiguous back to Pending (registry confirms
  no prior upload), fall through to the normal publish flow.
- **StillUnknown** → emit PublishFailed webhook + halt with an
  operator-actionable error; do NOT retry.

## Parallel + sequential symmetry

- Sequential: new `sequential_reconcile` helper in engine::mod wraps the
  full `RegistryClient::is_version_visible_with_backoff` and returns the
  same `(ReconciliationOutcome, Vec<ReadinessEvidence>)` shape as the
  parallel path's helper. Called from the per-package state-check match
  arm.
- Parallel: reuses existing `reconcile_ambiguous_upload` from
  engine::parallel::reconcile. Called just before the retry-loop entry,
  returning a completed PackageResult early on Published or StillUnknown.

Both paths emit `PublishReconciling` and `PublishReconciled` events so
the event stream records the resume-time reconciliation alongside any
in-flight reconciliation from earlier runs.

## Test plan

- [x] `cargo clippy --all-targets -- -D warnings` clean
- [x] `cargo fmt --check` clean
- [x] 192 engine tests pass (no regressions)

BDD scenarios for resume-path reconciliation across all three outcomes
are a natural follow-up; the handler shape mirrors the in-flight path so
regressions here surface through existing integration tests that already
cover the outcome shape.

## Related

- Parent: #99 Reconcile (the biggest non-cosmetic safety gap from the
  v0.3.0-rc.1 retrospective)
- Sibling PRs already merged: #111 (in-flight reconcile), #112 (events-
  as-truth), #113 (retry narration), #114 (registry-aware backoff)
- Master roadmap: #109
EffortlessSteven added a commit that referenced this pull request Apr 17, 2026
… follow-on) (#115)

Closes the remaining safety gap from #111: when a prior run left a package
in PackageState::Ambiguous (reconciliation was inconclusive at that time),
a subsequent `shipper resume` previously fell through to blind cargo retry.
That risked re-uploading a crate whose original upload may have actually
succeeded.

## What this adds

Both the sequential (engine::mod) and parallel (engine::parallel::publish)
publish paths now detect `PackageState::Ambiguous` BEFORE any cargo
activity and reconcile against registry truth first.

Outcomes follow the same state machine as in-flight reconciliation:

- **Published** → mark Published, skip any further cargo work for this
  crate, continue/return success receipt with no republish.
- **NotPublished** → clear Ambiguous back to Pending (registry confirms
  no prior upload), fall through to the normal publish flow.
- **StillUnknown** → emit PublishFailed webhook + halt with an
  operator-actionable error; do NOT retry.

## Parallel + sequential symmetry

- Sequential: new `sequential_reconcile` helper in engine::mod wraps the
  full `RegistryClient::is_version_visible_with_backoff` and returns the
  same `(ReconciliationOutcome, Vec<ReadinessEvidence>)` shape as the
  parallel path's helper. Called from the per-package state-check match
  arm.
- Parallel: reuses existing `reconcile_ambiguous_upload` from
  engine::parallel::reconcile. Called just before the retry-loop entry,
  returning a completed PackageResult early on Published or StillUnknown.

Both paths emit `PublishReconciling` and `PublishReconciled` events so
the event stream records the resume-time reconciliation alongside any
in-flight reconciliation from earlier runs.

## Test plan

- [x] `cargo clippy --all-targets -- -D warnings` clean
- [x] `cargo fmt --check` clean
- [x] 192 engine tests pass (no regressions)

BDD scenarios for resume-path reconciliation across all three outcomes
are a natural follow-up; the handler shape mirrors the in-flight path so
regressions here surface through existing integration tests that already
cover the outcome shape.

## Related

- Parent: #99 Reconcile (the biggest non-cosmetic safety gap from the
  v0.3.0-rc.1 retrospective)
- Sibling PRs already merged: #111 (in-flight reconcile), #112 (events-
  as-truth), #113 (retry narration), #114 (registry-aware backoff)
- Master roadmap: #109
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Registry-aware backoff: layer documented constraints > Retry-After > exponential

1 participant