Skip to content

docs: operator-trust pack — not_proven explainer + stalled-run triage + state-files cheat sheet#118

Merged
EffortlessSteven merged 1 commit into
mainfrom
docs/scary-but-correct-operator-pack
Apr 17, 2026
Merged

docs: operator-trust pack — not_proven explainer + stalled-run triage + state-files cheat sheet#118
EffortlessSteven merged 1 commit into
mainfrom
docs/scary-but-correct-operator-pack

Conversation

@EffortlessSteven

Copy link
Copy Markdown
Member

Closes the "scary-but-correct UX" papercuts flagged across multiple external retrospectives. Three new docs covering operator-legibility gaps where Shipper's behaviour is correct but the operator's mental model wasn't supported.

New docs (Diátaxis)

File Quadrant Purpose
`docs/explanation/finishability.md` Explanation Why `finishability = not_proven` is honest-not-alarming on first publish
`docs/how-to/inspect-a-stalled-run.md` How-to Live triage — "is the train alive?" 30-second check, question-to-file map, jq recipes
`docs/reference/state-files.md` Reference One-page cheat sheet — authority order, per-file fields, field-path caveats, jq one-liners

Index updates

  • `docs/README.md` (Diátaxis index): new entries under their quadrants
  • Root `README.md`: extended quick-links strip

Why these three specifically

Three external retrospectives converged on the same list:

  • `not_proven` reads alarming but is epistemically honest for first-publish (ownership can't be verified on a crate that doesn't exist yet)
  • Operators don't always know which file (`events.jsonl` / `state.json` / `receipt.json`) answers which question
  • There's no live-triage guide for "is it still alive?" vs "did it stall?"

The existing `inspect-state-and-receipts.md` covers post-hoc inspection. The new `inspect-a-stalled-run.md` is distinguished: live triage, during the run.

Scope

Pure docs. No code, no schema, no snapshot churn. 5 files changed, +334 / -4.

Related

Verification

… + state-files cheat sheet

Closes the "scary-but-correct UX" papercuts flagged across multiple
external retrospectives. Three new docs covering operator-legibility
gaps where Shipper's behavior is correct but the operator's mental
model wasn't supported.

## New docs (Diátaxis)

- **`docs/explanation/finishability.md`** — why `finishability = not_proven`
  is the correct answer on first publish, not danger. Maps each case
  (first publish, token lacks ownership, network flake) to concrete
  operator action. Notes the future rehearsal-registry path (#97) that
  will promote more NotProven cases to Proven.

- **`docs/how-to/inspect-a-stalled-run.md`** — live triage. "Is the train
  alive or hung?" 30-second check using events.jsonl tail. Maps common
  questions (current crate, how long waiting, what will resume do, why
  did it fail) to the authoritative file + jq recipe. Distinguished from
  the existing `inspect-state-and-receipts.md` which covers post-hoc
  "what happened."

- **`docs/reference/state-files.md`** — one-page cheat sheet. Authority
  order (events > state > receipt), per-file purpose table, key field
  paths (including the `.packages[].state.state` nesting caveat —
  common misread), jq one-liners for the most frequent queries, sidecar
  files reference.

## Navigation updates

- `docs/README.md` (Diátaxis index): add all three new entries under
  their correct quadrants
- Root `README.md`: extend the quick-links strip to include stalled-run
  triage, state-files cheat sheet, and the not_proven explainer

## Scope

Pure docs. No code, no schema, no snapshot churn. ~450 lines of new
content in 3 files + 2 index updates.

## Why these three specifically

Three external retrospectives converged on the same "scary-but-correct"
list:
- `not_proven` reads alarming but is epistemically honest for first-publish
- Operators don't always know which file answers which question
- There's no live-triage guide for "is it still alive?" vs "did it stall?"

This pack closes all three in a single focused PR.

## Related

- #103 Narrate umbrella (operator legibility is the theme; see scout comment)
- #99 follow-ons — complements #117 (cargo-stdout-as-hint docs) merged earlier this session
- #93 events-as-truth (the INVARIANTS the cheat sheet references)
@coderabbitai

coderabbitai Bot commented Apr 17, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@EffortlessSteven has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 12 minutes and 42 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 12 minutes and 42 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 07a285e6-4f84-491c-97c0-eefd5ad86cfa

📥 Commits

Reviewing files that changed from the base of the PR and between 0a6aa24 and 1484b84.

📒 Files selected for processing (5)
  • README.md
  • docs/README.md
  • docs/explanation/finishability.md
  • docs/how-to/inspect-a-stalled-run.md
  • docs/reference/state-files.md
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/scary-but-correct-operator-pack

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly expands the project's documentation by adding a triage guide for stalled or interrupted runs, a reference cheat sheet for the .shipper/ state files, and an explanation of the finishability states used in preflight checks. The main README and documentation index have been updated to incorporate these new resources. Feedback was provided to improve the technical accuracy and formatting of the triage table in the stalled run guide, specifically to ensure that event outcomes mentioned in the documentation match the actual JSON output produced by the tool.

Comment on lines +11 to +16
| Is the train alive or hung? | `events.jsonl` (latest entries) | `tail -n 20 .shipper/events.jsonl \| jq -c '.'` |
| What's the current crate? | `events.jsonl` (last `package_started`) | see below |
| How long has it been waiting? | `events.jsonl` (last `retry_backoff_started`) | see below |
| Which crates finished? | `events.jsonl` (published events) OR `state.json` | see below |
| What's next when I resume? | `state.json` (packages with `state.state == "pending"`) | see below |
| Why did it fail? | `events.jsonl` (last `package_failed` / `publish_reconciled.StillUnknown`) | see below |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The triage table would benefit from consistent use of backticks for filenames, event types, and commands to improve readability and distinguish technical terms from descriptive text. Additionally, StillUnknown should be changed to still_unknown to match the actual JSON output an operator will encounter in the event log (as shown in line 90).

Suggested change
| Is the train alive or hung? | `events.jsonl` (latest entries) | `tail -n 20 .shipper/events.jsonl \| jq -c '.'` |
| What's the current crate? | `events.jsonl` (last `package_started`) | see below |
| How long has it been waiting? | `events.jsonl` (last `retry_backoff_started`) | see below |
| Which crates finished? | `events.jsonl` (published events) OR `state.json` | see below |
| What's next when I resume? | `state.json` (packages with `state.state == "pending"`) | see below |
| Why did it fail? | `events.jsonl` (last `package_failed` / `publish_reconciled.StillUnknown`) | see below |
| Is the train alive or hung? | `events.jsonl` (latest entries) | `tail -n 20 .shipper/events.jsonl \| jq -c '.'` |
| What's the current crate? | `events.jsonl` (last `package_started`) | see below |
| How long has it been waiting? | `events.jsonl` (last `retry_backoff_started`) | see below |
| Which crates finished? | `events.jsonl` (published events) OR `state.json` | see below |
| What's next when I resume? | `state.json` (packages with `state.state == "pending"`) | see below |
| Why did it fail? | `events.jsonl` (last `package_failed` / `still_unknown` outcome) | see below |

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1484b84b1a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +96 to +97
1. Download the `shipper-state-final` artifact from the cancelled run (or `shipper-state-preflight` / `shipper-state-plan` if later stages never ran).
2. Trigger the `release-resume` workflow_dispatch with `mode=resume` and `artifact_run_id=<cancelled-run-id>`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Correct resume instructions for non-final artifacts

This workflow guidance says release-resume can continue from shipper-state-preflight/shipper-state-plan, but .github/workflows/release.yml hard-codes the download step to name: shipper-state-final; in a cancelled/timeout run before final upload, following these steps will fail with artifact-not-found instead of resuming. Please either limit the instructions to shipper-state-final or document the required workflow change to select a different artifact name.

Useful? React with 👍 / 👎.

| What happened, in order? | `events.jsonl` |
| What's the current state (fast lookup)? | `state.json` |
| Did the whole release succeed, and what's the audit trail? | `receipt.json` |
| What would `shipper resume` skip? | `state.json` (packages with `state.state == "published"`) |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include skipped packages in resume-skip guidance

This row says resume skips only packages where state.state == "published", but the engine also skips packages already in Skipped state; documenting only published underreports what shipper resume will bypass and can mislead operators during incident triage. The condition here should include both published and skipped.

Useful? React with 👍 / 👎.


## "What will resume do?"

`shipper resume` reads `state.json`, validates the `plan_id` matches the current workspace, and continues from the first non-terminal package. Terminal states for resume: `Published`, `Skipped`. Non-terminal: `Pending`, `Failed`, `Ambiguous`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Classify Uploaded as resumable in triage section

The state classification omits Uploaded, but interrupted runs can persist state.state == "uploaded"; resume handles this as a distinct path (skip cargo publish and continue readiness/verification), so excluding it makes the triage checklist incomplete and can cause operators to misinterpret what resume will actually do.

Useful? React with 👍 / 👎.

@EffortlessSteven EffortlessSteven merged commit 0589ba4 into main Apr 17, 2026
19 checks passed
@EffortlessSteven EffortlessSteven deleted the docs/scary-but-correct-operator-pack branch April 17, 2026 09:55
@codecov

codecov Bot commented Apr 17, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant