Skip to content

docs(ci): audit and rewrite test-evidence-lanes (#189, docs-first)#234

Merged
EffortlessSteven merged 1 commit into
mainfrom
docs/ci-lane-policy-audit-189
May 12, 2026
Merged

docs(ci): audit and rewrite test-evidence-lanes (#189, docs-first)#234
EffortlessSteven merged 1 commit into
mainfrom
docs/ci-lane-policy-audit-189

Conversation

@EffortlessSteven

Copy link
Copy Markdown
Member

Summary

Docs-first PR for #189. Captures the actual current state of .github/workflows/ so future routing decisions have an accurate baseline. No workflow files changed.

Why deferred from aggressive routing

#189 was filed with a richer lane-routing proposal (e.g. moving cross-platform from every-PR to labeled+nightly). That work is deferred because:

  • Current PR wall-clock is ~24–28 min on recent runs (close to the original 25-min target without aggressive routing).
  • Shipper has Windows-specific code paths — moving Windows/macOS/aarch64 builds off the PR lane means platform regressions surface a day later (we hit a Windows-specific failure mode this session).
  • Coverage-removing routing changes are hard to undo. Better to do them as their own focused PRs.

What the doc now captures (was missing/stale)

  • Complete workflow inventory: 10 workflows, ~28 jobs (issue body claimed 17 / 5).
  • Per-job lane map for ci.yml: predicates, wall-clock, what each job proves. Previously elided fuzz-smoke, cross-platform, release-build, crypto-proptests-heavy, policy; conflated lint into two rows.
  • All eight xtask policy checks in the policy-gates section (with the PR each was introduced in).
  • Droid workflows in the advisory section (was missing).
  • Mutation schedule corrected to "Sunday 04:00 UTC" (was "nightly").
  • release.yml job table including the policy-gate job from feat(harden): no-panic check + release CI gate (#187 PR 8b) #229.
  • "Routing Changes Deferred to Follow-Up PRs" section enumerating concrete movements with rationale: release-build to release-only, path-filtered fuzz-smoke, split cross-platform so only Linux is every-PR.

Test plan

  • No code changes; doc-only PR.
  • cargo fmt --all -- --check (unchanged source) clean.
  • The "Workflow Inventory" table cross-checked against ls .github/workflows/*.yml and each file's on: block.
  • The ci.yml job table cross-checked against the actual job names in the workflow file.
  • CI green.

Refs #189.

#189 was filed with a richer lane-routing proposal (e.g. moving
cross-platform from every-PR to labeled+nightly). That work is deferred
because:

  - Current PR wall-clock is ~24-28 min (recent CI runs from this
    rollout), close to the original 25-min target without aggressive
    routing.
  - Shipper has Windows-specific code paths (path handling, process
    spawning, line endings). Moving Windows/macOS/aarch64 builds to
    nightly means platform regressions surface a day later instead of
    inside the PR — that is a real loss for a release-pipeline product.
  - Coverage-removing routing changes are hard to undo: contributors
    stop expecting platform signal, regressions accumulate, then
    "fixing" the lanes becomes a multi-PR cleanup.

So #189 lands as a documentation pass: capture the actual current state
of .github/workflows/ so future routing decisions have an accurate
baseline.

Changes to docs/ci/test-evidence-lanes.md:

  - Replace the conflated "Always-On" table with a complete workflow
    inventory: 10 workflows, ~28 jobs across them, columns for trigger
    / lane / required-for-merge.

  - Per-job lane map for ci.yml (the load-bearing PR workflow), with
    predicates, observed wall-clock, and what each job proves. Previous
    doc was missing fuzz-smoke, cross-platform, release-build, crypto-
    proptests-heavy, and policy entirely; conflated `lint` into
    separate `fmt`/`clippy` rows.

  - Policy gates section now lists all eight xtask checks (file-policy,
    generated, executable, dependency-surface, workflow, process,
    network, lint-policy) plus the release-time no-panic check, with
    the PR each was introduced in.

  - Advisory/Routed section adds droid-review and droid (was missing).

  - Scheduled section corrects mutation to "Sunday 04:00 UTC" (was
    "nightly") and adds droid-security-scan.

  - Release Proof: full release.yml job table including the
    policy-gate job added in #229.

  - New "Routing Changes Deferred to Follow-Up PRs" section enumerates
    concrete movements with rationale: release-build to release-only,
    path-filtered fuzz-smoke, split cross-platform so only Linux is
    every-PR.

No workflow files changed. Coverage-removing routing decisions deserve
their own focused PRs.

  CHANGELOG.md                   Added entry under Documentation
  docs/ci/test-evidence-lanes.md ~180 lines rewritten
@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai

coderabbitai Bot commented May 12, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@EffortlessSteven has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 18 minutes and 40 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 74d49224-b497-4019-abca-75b07e88dee6

📥 Commits

Reviewing files that changed from the base of the PR and between 9cbde66 and 0da95b3.

📒 Files selected for processing (2)
  • CHANGELOG.md
  • docs/ci/test-evidence-lanes.md
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/ci-lane-policy-audit-189

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@factory-droid

factory-droid Bot commented May 12, 2026

Copy link
Copy Markdown

Droid finished @EffortlessSteven's task —— View job


Droid review complete. Validation Pass 2 finished.

Results: 6/6 candidates approved. All findings cross-checked against actual workflow files.

  • 2 P1s confirmed: architecture-guard.yml path filter omitted from Workflow Inventory; ci.yml fuzz-smoke row lists six wrong target names (actual: 5 targets: load_state, resolve_token, schema_version, release_levels, redact_output).
  • 2 P2s confirmed: coverage.yml and ripr.yml path filters absent from trigger descriptions.
  • 2 P3s confirmed: bdd/release-build needs: dependencies not reflected in predicate column; shipper-registry absent from mutation.yml weekly targets but listed in Trust-Critical Crates table.

No security issues found. Batched review submitted.

## Workflow Inventory

### Always-On (Every PR and Push)
Every workflow under `.github/workflows/` and the lane each one occupies. The audit dates from the #189 docs pass.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Workflow Inventory — architecture-guard.yml trigger omits path filter

The doc claims architecture-guard.yml triggers on push + pull_request without path restrictions. The actual file (architecture-guard.yml lines 3-12) restricts both triggers to only fire on path changes under crates/shipper-core/src/** or crates/shipper/src/**. A push to main that only touches documentation or CI configs does NOT trigger this workflow. The doc's Trigger column should reflect this path filtering, e.g.: push (main, path-filtered to crates/shipper-core/src/** + crates/shipper/src/**) + pull_request (path-filtered).

| `security` | every PR | ~1 min | `cargo audit` against the current advisory database. |
| `docs` | every PR | ~1 min | `cargo doc --workspace --no-deps` clean under `-D warnings` (catches `rustdoc::invalid-html-tags` and friends). |
| `bdd` | every PR | ~2 min | Cucumber scenarios for core publish/resume/reconcile flows. |
| `fuzz-smoke` | every PR except `schedule` | ~10 min | Six fuzz targets at low-energy: parser, encrypt, sanitizer, plan, state, events. |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] ci.yml job table — fuzz-smoke target count and names are inaccurate

Line 46 describes six fuzz targets as: parser, encrypt, sanitizer, plan, state, events. The actual ci.yml fuzz-smoke job (ci.yml lines ~347-351) runs exactly five cargo-fuzz commands: load_state, resolve_token, schema_version, release_levels, redact_output. The doc's "six" count is wrong and none of the listed names (parser, encrypt, sanitizer, plan, state, events) match the actual targets. The six names in the doc appear to conflate with the separate fuzz.yml nightly targets (load_state, resolve_token, encrypt_decrypt, retry_strategy, types_serialization, redact_output).

| `droid-review.yml` | `pull_request` | Advisory (same-repo + bot guard) | Advisory |
| `droid.yml` | `issues` + `pull_request` (command-triggered) | Advisory (trusted-actor guard) | Advisory |
| `droid-security-scan.yml` | `schedule` + `workflow_dispatch` | Scheduled (Mon 08:00 UTC) | Advisory |
| `fuzz.yml` | `schedule` + `workflow_dispatch` | Nightly | Advisory |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Workflow Inventory — coverage.yml trigger omits path filter

The doc lists coverage.yml trigger as push (main) without path restrictions. The actual coverage.yml (lines 4-11) has push + pull_request triggers with an explicit path filter: only paths matching Cargo.toml, Cargo.lock, crates/**, fuzz/**, .github/workflows/coverage.yml, or codecov.yml. The doc omits this, making it appear coverage runs on all main pushes when it does not.

|---|---|---|---|
| `lint` | every PR | ~1 min | `cargo fmt --check` + `cargo clippy --workspace --all-targets -- -D warnings`. |
| `policy` | every PR | ~1 min | All seven xtask policy checks in `--mode blocking-allowlist`, plus `policy-report`. See `docs/policy/NON_RUST_ROLLOUT.md`. |
| `test` (nextest, 3-OS matrix) | every PR | ~17 min (longest leg) | Unit and integration tests pass on Ubuntu, Windows, macOS. Doc-tests run alongside. |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Workflow Inventory — ripr.yml trigger omits path filter

The doc lists ripr.yml trigger as pull_request + workflow_dispatch. The actual ripr.yml (lines 4-11) restricts pull_request to only fire on paths matching crates/**, xtask/**, Cargo.{toml,lock}, ripr.toml, policy/ripr-suppressions.toml, or .github/workflows/ripr.yml. A docs-only PR does not trigger ripr. The path filter should be documented.

**Recent observed PR wall-clock:** 24–28 min (CI runs from this rollout). Critical path is `test` on macOS plus `fuzz-smoke`; everything else fits inside that window.

## Policy Gates (xtask-Enforced, Inside `ci.yml`'s `policy` Job)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P3] ci.yml job table — bdd and release-build predicates understate their dependencies

The doc says bdd and release-build have predicate 'every PR'. In the actual ci.yml, bdd has needs: [lint] and release-build has needs: [lint, test]. These jobs only run when their dependencies succeed. While technically still 'every PR' in the sense they run on every PR that passes lint, the simple 'every PR' framing without dependency context could mislead a contributor expecting them to run independently.

@@ -130,3 +162,19 @@ These crates receive the most rigorous mutation coverage because they handle rea
| `shipper-registry` | Registry API interactions |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P3] Trust-Critical Crates table — shipper-registry listed but excluded from mutation coverage

The Trust-Critical Crates table includes shipper-registry, correctly noting it handles 'Registry API interactions'. However, mutation.yml's mutants-weekly job only mutates shipper-duration, shipper-types, and shipper-config — shipper-registry is absent. The doc's 'Expanding to full trust-critical surface is a future rollout step' comment is accurate but easy to miss. A reader could reasonably conclude shipper-registry is covered by the weekly mutation lane when it is not.

@EffortlessSteven EffortlessSteven merged commit 16977f7 into main May 12, 2026
30 checks passed
@EffortlessSteven EffortlessSteven deleted the docs/ci-lane-policy-audit-189 branch May 12, 2026 21:22
EffortlessSteven added a commit that referenced this pull request May 12, 2026
…) (#237)

Stabilises `concurrent_version_exists_checks` (and any other test using
`with_multi_server`) on slow macOS CI runners. Hit three times in a
single rollout session (#233, #234, #236) as `version_exists: registry
request failed -> operation timed out` against the local tiny_http
mock.

Root cause: the helper's accept loop blocked on `handler(req)` until
each response was fully written before returning to `recv_timeout`.
With 5 concurrent reqwest clients hitting the same loopback socket,
the remaining clients sat in the kernel's TCP backlog long enough to
exceed reqwest's default OS-level connect timeout. Windows and Linux
runners process the queue fast enough to mask the bug; macOS does
not.

Fix: spawn one worker thread per accepted request and let the accept
loop return to `recv_timeout` immediately. The accept loop still
serialises on `recv_timeout` (tiny_http requires that), but handlers
run in parallel, so the kernel's listen queue drains as fast as
connections arrive.

Other changes:
  - `recv_timeout` bumped from 30s to 60s for additional headroom.
  - Trait bound on the handler closure goes from `Fn + Send + 'static`
    to `Fn + Send + Sync + 'static` (required to wrap the handler in
    `Arc` for clone-into-workers). All existing call sites use
    closures that already satisfy `Sync`.
  - The accept thread joins worker threads before returning so any
    panic in a handler surfaces in CI.

cargo test -p shipper-registry --lib passes 258/258 locally (Windows).
The fix targets a macOS-specific timing bug, so CI is the real
verification.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant