chore(ci): self-heal cargo registry cache against intel-runner race#1025
Closed
noahgift wants to merge 2 commits into
Closed
chore(ci): self-heal cargo registry cache against intel-runner race#1025noahgift wants to merge 2 commits into
noahgift wants to merge 2 commits into
Conversation
Intel runners bind-mount /home/noah/.cargo/registry into every workspace-test container. 16 clean-room runners write the SAME directory simultaneously, and a crate's src/ extraction can be partially removed by a concurrent job while its .crate tarball stays intact — so cargo thinks the crate is cached but fails with `couldn't read .../lib.rs`. Failure class: PR #987 saw five consecutive workspace-test failures 2026-04-23 on rustix-0.38.44 cache miss; same class of bug as the target/ race that disk-guard PR #1001 addressed for a different directory. Fix: before cargo compile, walk /usr/local/cargo/registry/src/*/*/ and remove any extraction missing `.cargo-ok` or `Cargo.toml`. Cargo re-extracts from the .crate tarball on next use (cheap; no network). Idempotent — no-op when cache is already consistent. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The original heuristic (missing .cargo-ok or Cargo.toml) missed a tighter failure mode observed on PR #1025's own workspace-test run: several crate extractions retained both marker files at the root but the `src/` directory itself was gone — cargo trusted the cache and failed on `couldn't read .../src/lib.rs`. New check: after the marker/manifest pair passes, also require at least one of src/lib.rs | src/main.rs | lib.rs | main.rs to exist. Flat-layout crates (fnv-1.0.7, macro_rules_attribute-proc_macro-0.2.2) keep lib.rs at the crate root with no src/ dir — the disjunction avoids a false positive on that entire class. Also fall back to `sudo rm -rf` when the initial rm fails on root-owned artifacts left by container runs (runner user is noah:1000, containers write as root via bind-mount). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
noahgift
added a commit
that referenced
this pull request
Apr 24, 2026
… race (ANDON paiml/infra#77) (#1043) * fix(ci): per-PR cargo registry to break intel-runner concurrent-write race (paiml/infra#77) ANDON 2026-04-24 — aprender 11-PR stack (#1031..#1042) all failing `ci / security` and `workspace-test` with: error: couldn't read /home/noah/.cargo/registry/src/<crate>/lib.rs: Permission denied (os error 13) and the rustix-0.38 equivalent (E0432 unresolved import `libc`/`libc_errno` originating in the `syscall` macro, which the rustix build.rs regenerates from src/ files — missing src/ → macro can't find libc crate → cascading errors). FIVE WHYS ───────── 1 `ci / security` fails: `cargo install cargo-audit --locked` hits EACCES reading `fnv-1.0.7/lib.rs`. 2 EACCES: the file is missing OR owned by root (docker container creates extractions as root on the bind-mounted host registry). 3 Concurrent writers: 16 self-hosted `intel-clean-room-*` runners bind-mount the SAME /home/noah/.cargo/registry — cargo extractions, the ci-reaper TTL sweep, and cross-container chown cycles all touch identical paths. 4 Shared by design: ci.yml:49 was authored for throughput — re-downloading crates per job is ~200MB, so the host registry was shared across all runners. Race class not modeled. 5 Precedent already exists: target/ hit the identical race under concurrent PRs (task #134) and was fixed by per-PR isolation on /mnt/nvme-raid0/targets/aprender-ci/<pr#>. The registry simply never got the same treatment. ROOT CAUSE ────────── Shared mutable bind mount + concurrent multi-runner write access ≈ guaranteed race. The existing band-aid (PR #1025 "self-heal cargo registry cache", cargo-ok + Cargo.toml marker check) only runs inside `ci / security` and itself races with concurrent jobs that have already passed the cache check. FIX (this PR) ───────────── Mirror the target-dir pattern from ci.yml:55 for the cargo registry. Each PR (or branch) gets its own registry under /mnt/nvme-raid0/cargo-ci/registry/<pr#>. Docker auto-creates the leaf dir on first mount; the ci-reaper TTL sweep (ci-reaper.sh:308) needs a companion infra update (paiml/infra#77) to include the new /mnt path. - Removes: /home/noah/.cargo/registry:/usr/local/cargo/registry - Adds: /mnt/nvme-raid0/cargo-ci/registry/${pr#|ref_name}:/usr/local/cargo/registry Cost: ~200MB per PR on first run (cargo re-downloads crates). Same cost profile as the target/ isolation fix, which the fleet already absorbed. Once cargo-ci/registry/<pr#> warms on run 1, run 2+ hit the cache. FOLLOW-UP ───────── paiml/infra#77 tracks: - forjar recipe to pre-create /mnt/nvme-raid0/cargo-ci/ owner=noah:noah - reaper extension: GC /mnt/nvme-raid0/cargo-ci/registry/<pr#>/src with same TTL - once infra lands, drop the ANDON comment above 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * ci: trigger fresh run to pick up paiml/.github#32 security-job CARGO_HOME fix --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Author
|
Superseded by #1043 (merged f6b4dff). #1043 fixes the same intel-runner shared-cargo-registry race at source (per-PR mount under /mnt/nvme-raid0/cargo-ci/<pr#>), mirroring the target-dir pattern at ci.yml:55. The self-heal approach here was a band-aid that itself raced under concurrent load (see paiml/infra#77 five-whys). Closing — no further action needed on this branch. |
auto-merge was automatically disabled
April 24, 2026 11:34
Pull request was closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
couldn't read .../rustix-0.38.44/src/lib.rserror)./usr/local/cargo/registry/src/*/*/and removes any extraction missing.cargo-okorCargo.toml; cargo re-extracts from the intact.cratetarball (cheap, no network).Root cause
Intel runners bind-mount
/home/noah/.cargo/registryinto everyworkspace-testcontainer (ci.yml line 49). 16 clean-room runners write the same directory simultaneously. When one job partially removes a crate's extractedsrc/while another's.cratetarball remains inregistry/cache/, cargo trusts the stale state and fails at compile.This is the same class of bug as the
target/race that disk-guard PR #1001 mitigated — just a different shared directory.Fix
.cargo-okorCargo.toml) is missing. Cargo's extractor writes.cargo-oklast, so a missing marker always means a broken extraction.stat(2), sub-second on the intel host.Test plan
🤖 Generated with Claude Code