docs: operator-trust pack — not_proven explainer + stalled-run triage + state-files cheat sheet#118
Conversation
… + state-files cheat sheet Closes the "scary-but-correct UX" papercuts flagged across multiple external retrospectives. Three new docs covering operator-legibility gaps where Shipper's behavior is correct but the operator's mental model wasn't supported. ## New docs (Diátaxis) - **`docs/explanation/finishability.md`** — why `finishability = not_proven` is the correct answer on first publish, not danger. Maps each case (first publish, token lacks ownership, network flake) to concrete operator action. Notes the future rehearsal-registry path (#97) that will promote more NotProven cases to Proven. - **`docs/how-to/inspect-a-stalled-run.md`** — live triage. "Is the train alive or hung?" 30-second check using events.jsonl tail. Maps common questions (current crate, how long waiting, what will resume do, why did it fail) to the authoritative file + jq recipe. Distinguished from the existing `inspect-state-and-receipts.md` which covers post-hoc "what happened." - **`docs/reference/state-files.md`** — one-page cheat sheet. Authority order (events > state > receipt), per-file purpose table, key field paths (including the `.packages[].state.state` nesting caveat — common misread), jq one-liners for the most frequent queries, sidecar files reference. ## Navigation updates - `docs/README.md` (Diátaxis index): add all three new entries under their correct quadrants - Root `README.md`: extend the quick-links strip to include stalled-run triage, state-files cheat sheet, and the not_proven explainer ## Scope Pure docs. No code, no schema, no snapshot churn. ~450 lines of new content in 3 files + 2 index updates. ## Why these three specifically Three external retrospectives converged on the same "scary-but-correct" list: - `not_proven` reads alarming but is epistemically honest for first-publish - Operators don't always know which file answers which question - There's no live-triage guide for "is it still alive?" vs "did it stall?" This pack closes all three in a single focused PR. ## Related - #103 Narrate umbrella (operator legibility is the theme; see scout comment) - #99 follow-ons — complements #117 (cargo-stdout-as-hint docs) merged earlier this session - #93 events-as-truth (the INVARIANTS the cheat sheet references)
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 12 minutes and 42 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (5)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request significantly expands the project's documentation by adding a triage guide for stalled or interrupted runs, a reference cheat sheet for the .shipper/ state files, and an explanation of the finishability states used in preflight checks. The main README and documentation index have been updated to incorporate these new resources. Feedback was provided to improve the technical accuracy and formatting of the triage table in the stalled run guide, specifically to ensure that event outcomes mentioned in the documentation match the actual JSON output produced by the tool.
| | Is the train alive or hung? | `events.jsonl` (latest entries) | `tail -n 20 .shipper/events.jsonl \| jq -c '.'` | | ||
| | What's the current crate? | `events.jsonl` (last `package_started`) | see below | | ||
| | How long has it been waiting? | `events.jsonl` (last `retry_backoff_started`) | see below | | ||
| | Which crates finished? | `events.jsonl` (published events) OR `state.json` | see below | | ||
| | What's next when I resume? | `state.json` (packages with `state.state == "pending"`) | see below | | ||
| | Why did it fail? | `events.jsonl` (last `package_failed` / `publish_reconciled.StillUnknown`) | see below | |
There was a problem hiding this comment.
The triage table would benefit from consistent use of backticks for filenames, event types, and commands to improve readability and distinguish technical terms from descriptive text. Additionally, StillUnknown should be changed to still_unknown to match the actual JSON output an operator will encounter in the event log (as shown in line 90).
| | Is the train alive or hung? | `events.jsonl` (latest entries) | `tail -n 20 .shipper/events.jsonl \| jq -c '.'` | | |
| | What's the current crate? | `events.jsonl` (last `package_started`) | see below | | |
| | How long has it been waiting? | `events.jsonl` (last `retry_backoff_started`) | see below | | |
| | Which crates finished? | `events.jsonl` (published events) OR `state.json` | see below | | |
| | What's next when I resume? | `state.json` (packages with `state.state == "pending"`) | see below | | |
| | Why did it fail? | `events.jsonl` (last `package_failed` / `publish_reconciled.StillUnknown`) | see below | | |
| | Is the train alive or hung? | `events.jsonl` (latest entries) | `tail -n 20 .shipper/events.jsonl \| jq -c '.'` | | |
| | What's the current crate? | `events.jsonl` (last `package_started`) | see below | | |
| | How long has it been waiting? | `events.jsonl` (last `retry_backoff_started`) | see below | | |
| | Which crates finished? | `events.jsonl` (published events) OR `state.json` | see below | | |
| | What's next when I resume? | `state.json` (packages with `state.state == "pending"`) | see below | | |
| | Why did it fail? | `events.jsonl` (last `package_failed` / `still_unknown` outcome) | see below | |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1484b84b1a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 1. Download the `shipper-state-final` artifact from the cancelled run (or `shipper-state-preflight` / `shipper-state-plan` if later stages never ran). | ||
| 2. Trigger the `release-resume` workflow_dispatch with `mode=resume` and `artifact_run_id=<cancelled-run-id>`. |
There was a problem hiding this comment.
Correct resume instructions for non-final artifacts
This workflow guidance says release-resume can continue from shipper-state-preflight/shipper-state-plan, but .github/workflows/release.yml hard-codes the download step to name: shipper-state-final; in a cancelled/timeout run before final upload, following these steps will fail with artifact-not-found instead of resuming. Please either limit the instructions to shipper-state-final or document the required workflow change to select a different artifact name.
Useful? React with 👍 / 👎.
| | What happened, in order? | `events.jsonl` | | ||
| | What's the current state (fast lookup)? | `state.json` | | ||
| | Did the whole release succeed, and what's the audit trail? | `receipt.json` | | ||
| | What would `shipper resume` skip? | `state.json` (packages with `state.state == "published"`) | |
There was a problem hiding this comment.
Include skipped packages in resume-skip guidance
This row says resume skips only packages where state.state == "published", but the engine also skips packages already in Skipped state; documenting only published underreports what shipper resume will bypass and can mislead operators during incident triage. The condition here should include both published and skipped.
Useful? React with 👍 / 👎.
|
|
||
| ## "What will resume do?" | ||
|
|
||
| `shipper resume` reads `state.json`, validates the `plan_id` matches the current workspace, and continues from the first non-terminal package. Terminal states for resume: `Published`, `Skipped`. Non-terminal: `Pending`, `Failed`, `Ambiguous`. |
There was a problem hiding this comment.
Classify Uploaded as resumable in triage section
The state classification omits Uploaded, but interrupted runs can persist state.state == "uploaded"; resume handles this as a distinct path (skip cargo publish and continue readiness/verification), so excluding it makes the triage checklist incomplete and can cause operators to misinterpret what resume will actually do.
Useful? React with 👍 / 👎.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Closes the "scary-but-correct UX" papercuts flagged across multiple external retrospectives. Three new docs covering operator-legibility gaps where Shipper's behaviour is correct but the operator's mental model wasn't supported.
New docs (Diátaxis)
Index updates
Why these three specifically
Three external retrospectives converged on the same list:
The existing `inspect-state-and-receipts.md` covers post-hoc inspection. The new `inspect-a-stalled-run.md` is distinguished: live triage, during the run.
Scope
Pure docs. No code, no schema, no snapshot churn. 5 files changed, +334 / -4.
Related
Verification