Skip to content

chore(ci): self-heal cargo registry cache against intel-runner race#1025

Closed
noahgift wants to merge 2 commits into
mainfrom
chore/ci-registry-self-heal
Closed

chore(ci): self-heal cargo registry cache against intel-runner race#1025
noahgift wants to merge 2 commits into
mainfrom
chore/ci-registry-self-heal

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Root cause

Intel runners bind-mount /home/noah/.cargo/registry into every workspace-test container (ci.yml line 49). 16 clean-room runners write the same directory simultaneously. When one job partially removes a crate's extracted src/ while another's .crate tarball remains in registry/cache/, cargo trusts the stale state and fails at compile.

This is the same class of bug as the target/ race that disk-guard PR #1001 mitigated — just a different shared directory.

Fix

  • Idempotent: no-op when the cache is consistent.
  • Targeted: only removes entries where one of the two cargo-trust markers (.cargo-ok or Cargo.toml) is missing. Cargo's extractor writes .cargo-ok last, so a missing marker always means a broken extraction.
  • Fast: walks ~4k dirs via stat(2), sub-second on the intel host.

Test plan

🤖 Generated with Claude Code

Intel runners bind-mount /home/noah/.cargo/registry into every
workspace-test container. 16 clean-room runners write the SAME
directory simultaneously, and a crate's src/ extraction can be
partially removed by a concurrent job while its .crate tarball
stays intact — so cargo thinks the crate is cached but fails with
`couldn't read .../lib.rs`.

Failure class: PR #987 saw five consecutive workspace-test failures
2026-04-23 on rustix-0.38.44 cache miss; same class of bug as the
target/ race that disk-guard PR #1001 addressed for a different
directory.

Fix: before cargo compile, walk /usr/local/cargo/registry/src/*/*/
and remove any extraction missing `.cargo-ok` or `Cargo.toml`.
Cargo re-extracts from the .crate tarball on next use (cheap; no
network). Idempotent — no-op when cache is already consistent.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) April 23, 2026 11:55
The original heuristic (missing .cargo-ok or Cargo.toml) missed a
tighter failure mode observed on PR #1025's own workspace-test run:
several crate extractions retained both marker files at the root but
the `src/` directory itself was gone — cargo trusted the cache and
failed on `couldn't read .../src/lib.rs`.

New check: after the marker/manifest pair passes, also require at least
one of src/lib.rs | src/main.rs | lib.rs | main.rs to exist. Flat-layout
crates (fnv-1.0.7, macro_rules_attribute-proc_macro-0.2.2) keep lib.rs
at the crate root with no src/ dir — the disjunction avoids a false
positive on that entire class.

Also fall back to `sudo rm -rf` when the initial rm fails on root-owned
artifacts left by container runs (runner user is noah:1000, containers
write as root via bind-mount).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 24, 2026
… race (ANDON paiml/infra#77) (#1043)

* fix(ci): per-PR cargo registry to break intel-runner concurrent-write race (paiml/infra#77)

ANDON 2026-04-24 — aprender 11-PR stack (#1031..#1042) all failing `ci / security`
and `workspace-test` with:

  error: couldn't read /home/noah/.cargo/registry/src/<crate>/lib.rs:
         Permission denied (os error 13)

and the rustix-0.38 equivalent (E0432 unresolved import `libc`/`libc_errno`
originating in the `syscall` macro, which the rustix build.rs regenerates from
src/ files — missing src/ → macro can't find libc crate → cascading errors).

FIVE WHYS
─────────
 1 `ci / security` fails: `cargo install cargo-audit --locked` hits EACCES
   reading `fnv-1.0.7/lib.rs`.
 2 EACCES: the file is missing OR owned by root (docker container creates
   extractions as root on the bind-mounted host registry).
 3 Concurrent writers: 16 self-hosted `intel-clean-room-*` runners bind-mount
   the SAME /home/noah/.cargo/registry — cargo extractions, the ci-reaper
   TTL sweep, and cross-container chown cycles all touch identical paths.
 4 Shared by design: ci.yml:49 was authored for throughput — re-downloading
   crates per job is ~200MB, so the host registry was shared across all
   runners. Race class not modeled.
 5 Precedent already exists: target/ hit the identical race under concurrent
   PRs (task #134) and was fixed by per-PR isolation on
   /mnt/nvme-raid0/targets/aprender-ci/<pr#>. The registry simply never got
   the same treatment.

ROOT CAUSE
──────────
Shared mutable bind mount + concurrent multi-runner write access ≈ guaranteed
race. The existing band-aid (PR #1025 "self-heal cargo registry cache",
cargo-ok + Cargo.toml marker check) only runs inside `ci / security` and
itself races with concurrent jobs that have already passed the cache check.

FIX (this PR)
─────────────
Mirror the target-dir pattern from ci.yml:55 for the cargo registry. Each
PR (or branch) gets its own registry under /mnt/nvme-raid0/cargo-ci/registry/<pr#>.
Docker auto-creates the leaf dir on first mount; the ci-reaper TTL sweep
(ci-reaper.sh:308) needs a companion infra update (paiml/infra#77) to include
the new /mnt path.

 - Removes: /home/noah/.cargo/registry:/usr/local/cargo/registry
 - Adds:    /mnt/nvme-raid0/cargo-ci/registry/${pr#|ref_name}:/usr/local/cargo/registry

Cost: ~200MB per PR on first run (cargo re-downloads crates). Same cost
profile as the target/ isolation fix, which the fleet already absorbed.
Once cargo-ci/registry/<pr#> warms on run 1, run 2+ hit the cache.

FOLLOW-UP
─────────
paiml/infra#77 tracks:
  - forjar recipe to pre-create /mnt/nvme-raid0/cargo-ci/ owner=noah:noah
  - reaper extension: GC /mnt/nvme-raid0/cargo-ci/registry/<pr#>/src with same TTL
  - once infra lands, drop the ANDON comment above

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* ci: trigger fresh run to pick up paiml/.github#32 security-job CARGO_HOME fix

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift

Copy link
Copy Markdown
Contributor Author

Superseded by #1043 (merged f6b4dff). #1043 fixes the same intel-runner shared-cargo-registry race at source (per-PR mount under /mnt/nvme-raid0/cargo-ci/<pr#>), mirroring the target-dir pattern at ci.yml:55. The self-heal approach here was a band-aid that itself raced under concurrent load (see paiml/infra#77 five-whys). Closing — no further action needed on this branch.

@noahgift noahgift closed this Apr 24, 2026
auto-merge was automatically disabled April 24, 2026 11:34

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant