Skip to content

feat(narrate): retry visibility — structured events + live CLI narration (#91)#113

Merged
EffortlessSteven merged 1 commit into
mainfrom
feat/91-retry-visibility
Apr 17, 2026
Merged

feat(narrate): retry visibility — structured events + live CLI narration (#91)#113
EffortlessSteven merged 1 commit into
mainfrom
feat/91-retry-visibility

Conversation

@EffortlessSteven

Copy link
Copy Markdown
Member

Closes #91 PR 1 — the highest-leverage operator-UX fix from ROADMAP's Now block.

Problem

During v0.3.0-rc.1's first real publish, Shipper absorbed 41 retries across a 69-minute run and emitted zero live signal during the 9–12 minute rate-limit windows. Operators had to run an external sparse-index polling monitor to know the train was alive. Three `thread::sleep(delay)` sites were fully silent.

Solution

New `EventType::RetryBackoffStarted` variant carries the full retry context:
`attempt`, `max_attempts`, `delay_ms`, `next_attempt_at: DateTime`, `reason: ErrorClass`, `message: String`.

Two helpers, one per retry loop shape:

  • `engine::parallel::publish::emit_retry_backoff` — handles Arc<Mutex<_>> wrapping (3 call sites)
  • `engine::mod::emit_retry_backoff_event` — raw &mut refs (2 call sites)

Each emits the structured event, warns via Reporter with rich format, then sleeps.

Behaviour delta

Before: silent `thread::sleep(delay)` on every retry.

After: every retry emits one event + one warn line:
```
shipper-sparse-index@0.3.0-rc.1: HTTP 429 Too Many Requests (Retryable); next attempt in 10m 30s (attempt 3/12)
```

Three previously-silent readiness-retry sites now also narrate.

Files

File Change
`crates/shipper-types/src/lib.rs` +`EventType::RetryBackoffStarted` variant
`crates/shipper/src/engine/parallel/publish.rs` +`emit_retry_backoff` helper; 3 sites rewired
`crates/shipper/src/engine/mod.rs` +`emit_retry_backoff_event` helper; 2 sites rewired; 1 assertion updated to match new message

Net: 3 files, +199 / -24.

Test plan

  • `cargo check` + `cargo clippy --all-targets -- -D warnings` clean
  • `cargo fmt --check` clean
  • shipper-types: 268 pass
  • engine + parallel + consistency modules: 199 pass (no regressions)
  • Updated one test assertion that matched the old "retrying in" text to the new "next attempt in" text

Out of scope

Related

…ion (#91)

Replaces every silent `thread::sleep(delay)` at a retry-backoff site with a
structured event + human-readable warn line. Closes the 60-minute silent-
window problem from v0.3.0-rc.1 where operators had to run an external
sparse-index monitor just to know the train was alive.

## Type addition (shipper-types)

New `EventType::RetryBackoffStarted` variant carries the full retry context:

- `attempt` — the just-failed attempt number (1-indexed)
- `max_attempts` — total budget
- `delay_ms` — how long we'll sleep
- `next_attempt_at: DateTime<Utc>` — computed wake time (useful for
  dashboards and ETAs)
- `reason: ErrorClass` — what class of failure triggered the retry
- `message: String` — one-line human-facing description

Additive. No breaking changes.

## Wire-in

Two helpers, one per retry loop shape:

- `engine::parallel::publish::emit_retry_backoff` — takes Arc<Mutex<_>>
  wrappers; used at the 3 retry-backoff sites in the parallel publish loop
  (cargo-failure retry, readiness-not-visible retry, readiness-errored retry)
- `engine::mod::emit_retry_backoff_event` — takes raw &mut references;
  used at the 2 retry-backoff sites in the sequential publish loop

Each helper:
1. Records a `RetryBackoffStarted` event and flushes to events.jsonl
2. Warns via Reporter with the rich format:
   `name@version: <message> (<ErrorClass>); next attempt in <duration> (attempt N/M)`
3. Sleeps for the configured delay

## Behavior delta

**Before:** silent `thread::sleep(delay)` on every retry; operators stared at
a quiet CI log for 9–12 minutes between each rate-limit window.

**After:** every retry emits one event + one warn line carrying reason,
duration, next-attempt time, and progress (N of M). Three readiness-retry
sites that were FULLY silent before now also narrate.

## Test plan

- [x] `cargo check` + `cargo clippy --all-targets -- -D warnings` clean
- [x] `cargo fmt --check` clean
- [x] shipper-types tests: 268 pass
- [x] `engine::parallel` + `engine::tests` + `state::consistency`: 199 pass
- [x] Updated one existing assertion that matched the old "retrying in" text
      to the new "next attempt in" text

## Scope

Closes #91's PR 1. Follow-up PR 2 (CLI-specific rendering polish like
`shipper watch` subcommand, ETA/progress line aggregation) is tracked under
the same issue but out of scope for this PR — the structured event is here
for any consumer to render however they like.

## Related

- Parent issue: #91 (umbrella: #103 Narrate — Partial)
- Master roadmap: #109 (Now block item 3)
@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai

coderabbitai Bot commented Apr 17, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@EffortlessSteven has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 0 minutes and 9 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 0 minutes and 9 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ebc2ac60-93db-42a2-a1df-dcb40dfb1621

📥 Commits

Reviewing files that changed from the base of the PR and between 0eeb91f and 0ac2d4c.

📒 Files selected for processing (3)
  • crates/shipper-types/src/lib.rs
  • crates/shipper/src/engine/mod.rs
  • crates/shipper/src/engine/parallel/publish.rs
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/91-retry-visibility

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0ac2d4ce6a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1163 to +1164
attempt.saturating_add(1),
max_attempts,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid advertising a retry after the final allowed attempt

emit_retry_backoff_event always reports the next attempt as attempt + 1, but the loop still invokes this helper when attempt == max_attempts; after that sleep, the loop exits and no further publish attempt happens. This yields impossible progress like attempt 4/3 and emits RetryBackoffStarted for a retry that cannot occur, which can mislead operators or any tooling that reads retry telemetry (the parallel helper repeats the same pattern in engine/parallel/publish.rs).

Useful? React with 👍 / 👎.

@codecov

codecov Bot commented Apr 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 79.75460% with 33 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/shipper/src/engine/parallel/publish.rs 65.93% 31 Missing ⚠️
crates/shipper/src/engine/mod.rs 97.22% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@EffortlessSteven EffortlessSteven merged commit dbfb76c into main Apr 17, 2026
20 of 21 checks passed
@EffortlessSteven EffortlessSteven deleted the feat/91-retry-visibility branch April 17, 2026 03:34
EffortlessSteven added a commit that referenced this pull request Apr 17, 2026
… follow-on)

Closes the remaining safety gap from #111: when a prior run left a package
in PackageState::Ambiguous (reconciliation was inconclusive at that time),
a subsequent `shipper resume` previously fell through to blind cargo retry.
That risked re-uploading a crate whose original upload may have actually
succeeded.

## What this adds

Both the sequential (engine::mod) and parallel (engine::parallel::publish)
publish paths now detect `PackageState::Ambiguous` BEFORE any cargo
activity and reconcile against registry truth first.

Outcomes follow the same state machine as in-flight reconciliation:

- **Published** → mark Published, skip any further cargo work for this
  crate, continue/return success receipt with no republish.
- **NotPublished** → clear Ambiguous back to Pending (registry confirms
  no prior upload), fall through to the normal publish flow.
- **StillUnknown** → emit PublishFailed webhook + halt with an
  operator-actionable error; do NOT retry.

## Parallel + sequential symmetry

- Sequential: new `sequential_reconcile` helper in engine::mod wraps the
  full `RegistryClient::is_version_visible_with_backoff` and returns the
  same `(ReconciliationOutcome, Vec<ReadinessEvidence>)` shape as the
  parallel path's helper. Called from the per-package state-check match
  arm.
- Parallel: reuses existing `reconcile_ambiguous_upload` from
  engine::parallel::reconcile. Called just before the retry-loop entry,
  returning a completed PackageResult early on Published or StillUnknown.

Both paths emit `PublishReconciling` and `PublishReconciled` events so
the event stream records the resume-time reconciliation alongside any
in-flight reconciliation from earlier runs.

## Test plan

- [x] `cargo clippy --all-targets -- -D warnings` clean
- [x] `cargo fmt --check` clean
- [x] 192 engine tests pass (no regressions)

BDD scenarios for resume-path reconciliation across all three outcomes
are a natural follow-up; the handler shape mirrors the in-flight path so
regressions here surface through existing integration tests that already
cover the outcome shape.

## Related

- Parent: #99 Reconcile (the biggest non-cosmetic safety gap from the
  v0.3.0-rc.1 retrospective)
- Sibling PRs already merged: #111 (in-flight reconcile), #112 (events-
  as-truth), #113 (retry narration), #114 (registry-aware backoff)
- Master roadmap: #109
EffortlessSteven added a commit that referenced this pull request Apr 17, 2026
… follow-on) (#115)

Closes the remaining safety gap from #111: when a prior run left a package
in PackageState::Ambiguous (reconciliation was inconclusive at that time),
a subsequent `shipper resume` previously fell through to blind cargo retry.
That risked re-uploading a crate whose original upload may have actually
succeeded.

## What this adds

Both the sequential (engine::mod) and parallel (engine::parallel::publish)
publish paths now detect `PackageState::Ambiguous` BEFORE any cargo
activity and reconcile against registry truth first.

Outcomes follow the same state machine as in-flight reconciliation:

- **Published** → mark Published, skip any further cargo work for this
  crate, continue/return success receipt with no republish.
- **NotPublished** → clear Ambiguous back to Pending (registry confirms
  no prior upload), fall through to the normal publish flow.
- **StillUnknown** → emit PublishFailed webhook + halt with an
  operator-actionable error; do NOT retry.

## Parallel + sequential symmetry

- Sequential: new `sequential_reconcile` helper in engine::mod wraps the
  full `RegistryClient::is_version_visible_with_backoff` and returns the
  same `(ReconciliationOutcome, Vec<ReadinessEvidence>)` shape as the
  parallel path's helper. Called from the per-package state-check match
  arm.
- Parallel: reuses existing `reconcile_ambiguous_upload` from
  engine::parallel::reconcile. Called just before the retry-loop entry,
  returning a completed PackageResult early on Published or StillUnknown.

Both paths emit `PublishReconciling` and `PublishReconciled` events so
the event stream records the resume-time reconciliation alongside any
in-flight reconciliation from earlier runs.

## Test plan

- [x] `cargo clippy --all-targets -- -D warnings` clean
- [x] `cargo fmt --check` clean
- [x] 192 engine tests pass (no regressions)

BDD scenarios for resume-path reconciliation across all three outcomes
are a natural follow-up; the handler shape mirrors the in-flight path so
regressions here surface through existing integration tests that already
cover the outcome shape.

## Related

- Parent: #99 Reconcile (the biggest non-cosmetic safety gap from the
  v0.3.0-rc.1 retrospective)
- Sibling PRs already merged: #111 (in-flight reconcile), #112 (events-
  as-truth), #113 (retry narration), #114 (registry-aware backoff)
- Master roadmap: #109
EffortlessSteven added a commit that referenced this pull request Apr 19, 2026
Implements PR 1 from the scout's plan on issue #103. Wires the existing
RetryBackoffStarted event (landed in PR #113) to a live CLI countdown
in the publish progress reporter, lifting the previously dead
`ProgressReporter::set_status` function and replacing the silent
`thread::sleep(delay)` window with per-second narration in TTY mode.

## What changed

- `engine::Reporter` (and the parallel `Reporter` mirror) gain a new
  `retry_wait(name, version, attempt, max_attempts, delay, reason, msg)`
  default method. The default impl preserves the pre-#103 behavior
  (single `warn()` line + `thread::sleep(delay)`), so every existing
  reporter implementation continues to behave exactly as before.
- The two retry-backoff helpers (`engine::emit_retry_backoff_event` and
  `engine::parallel::publish::emit_retry_backoff`) now record the
  `RetryBackoffStarted` event and then delegate the human-facing
  message + the backoff sleep to `Reporter::retry_wait`. The events.jsonl
  schema is unchanged.
- `ProgressReporter::set_status` is lifted off `#[allow(dead_code)]`.
- New `ProgressReporter::retry_countdown(...)` renders:
  - **TTY:** `[1/N] retrying name@ver in 45s... (attempt 2/5, reason: Retryable) - <msg>`
    refreshed once per second via `set_status` until the wait expires.
  - **Non-TTY (CI):** a single `[retry] name@ver: msg (Retryable); next attempt in 45s (attempt 2/5)`
    line via `eprintln!`, then `thread::sleep(delay)` — no per-second
    spam in pipeline logs.
  - **Quiet:** silent, but still blocks for the full delay.
- `CliReporter` grows an optional `progress: Option<ProgressReporter>`
  field. The `publish` and `resume` subcommands install the existing
  per-registry `ProgressReporter` on the reporter before the engine
  call (and `take_progress().finish()` after), so the engine's
  retry-narration path drives the live countdown.
- `CliReporter::retry_wait` overrides the trait default to route
  through the installed progress, with a graceful fallback to the
  warn-line behavior for callers without a progress bar (e.g. tests).

## Acceptance criteria from the scout's plan

- [x] When `RetryBackoffStarted` is emitted, operator sees:
      - TTY: countdown line ticking down once per second
      - Non-TTY: one-shot `[retry] ...` line then sleep
- [x] Event is still emitted to events.jsonl; no event-schema changes
- [x] `ProgressReporter::set_status` is no longer dead code
- [x] Existing reporter implementations (TestReporter, CollectingReporter,
      SendReporter) continue to compile and behave the same via the
      default trait impl

## Tests

- `progress::tests`: 4 new cases — silent-mode blocks for delay, zero
  delay returns immediately, non-TTY one-shot line drives the !is_tty
  branch, exotic attempt/max boundaries don't panic.
- `lib::tests`: 3 new cases — `CliReporter::retry_wait` without progress
  blocks for delay, with progress routes through countdown, default
  trait impl preserves the canonical warn-line format.
- All 113 `shipper-cli` lib tests pass; all 22 `cli_e2e` integration
  tests pass; all 1890 `shipper-core` tests pass; all 39 `shipper`
  facade tests pass.
EffortlessSteven added a commit that referenced this pull request Apr 21, 2026
* feat(narrate): live CLI countdown during retry/backoff waits (#103 PR 1)

Implements PR 1 from the scout's plan on issue #103. Wires the existing
RetryBackoffStarted event (landed in PR #113) to a live CLI countdown
in the publish progress reporter, lifting the previously dead
`ProgressReporter::set_status` function and replacing the silent
`thread::sleep(delay)` window with per-second narration in TTY mode.

## What changed

- `engine::Reporter` (and the parallel `Reporter` mirror) gain a new
  `retry_wait(name, version, attempt, max_attempts, delay, reason, msg)`
  default method. The default impl preserves the pre-#103 behavior
  (single `warn()` line + `thread::sleep(delay)`), so every existing
  reporter implementation continues to behave exactly as before.
- The two retry-backoff helpers (`engine::emit_retry_backoff_event` and
  `engine::parallel::publish::emit_retry_backoff`) now record the
  `RetryBackoffStarted` event and then delegate the human-facing
  message + the backoff sleep to `Reporter::retry_wait`. The events.jsonl
  schema is unchanged.
- `ProgressReporter::set_status` is lifted off `#[allow(dead_code)]`.
- New `ProgressReporter::retry_countdown(...)` renders:
  - **TTY:** `[1/N] retrying name@ver in 45s... (attempt 2/5, reason: Retryable) - <msg>`
    refreshed once per second via `set_status` until the wait expires.
  - **Non-TTY (CI):** a single `[retry] name@ver: msg (Retryable); next attempt in 45s (attempt 2/5)`
    line via `eprintln!`, then `thread::sleep(delay)` — no per-second
    spam in pipeline logs.
  - **Quiet:** silent, but still blocks for the full delay.
- `CliReporter` grows an optional `progress: Option<ProgressReporter>`
  field. The `publish` and `resume` subcommands install the existing
  per-registry `ProgressReporter` on the reporter before the engine
  call (and `take_progress().finish()` after), so the engine's
  retry-narration path drives the live countdown.
- `CliReporter::retry_wait` overrides the trait default to route
  through the installed progress, with a graceful fallback to the
  warn-line behavior for callers without a progress bar (e.g. tests).

## Acceptance criteria from the scout's plan

- [x] When `RetryBackoffStarted` is emitted, operator sees:
      - TTY: countdown line ticking down once per second
      - Non-TTY: one-shot `[retry] ...` line then sleep
- [x] Event is still emitted to events.jsonl; no event-schema changes
- [x] `ProgressReporter::set_status` is no longer dead code
- [x] Existing reporter implementations (TestReporter, CollectingReporter,
      SendReporter) continue to compile and behave the same via the
      default trait impl

## Tests

- `progress::tests`: 4 new cases — silent-mode blocks for delay, zero
  delay returns immediately, non-TTY one-shot line drives the !is_tty
  branch, exotic attempt/max boundaries don't panic.
- `lib::tests`: 3 new cases — `CliReporter::retry_wait` without progress
  blocks for delay, with progress routes through countdown, default
  trait impl preserves the canonical warn-line format.
- All 113 `shipper-cli` lib tests pass; all 22 `cli_e2e` integration
  tests pass; all 1890 `shipper-core` tests pass; all 39 `shipper`
  facade tests pass.

* fix(parallel): release reporter lock before retry sleep

* fix(narrate): restore live retry feedback in parallel mode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Surface retry/backoff state to operators (structured events + CLI visibility)

1 participant