Skip to content

CI infra: disk-guard cross-runner race wipes target/ mid-build (ci/coverage persistent flake) #1020

@noahgift

Description

@noahgift

Summary

ci/coverage on self-hosted runners fails with No such file or directory (os error 2) mid-compile. Root cause: [runner-disk-guard] pre-job hook on one runner aggressively prunes target/ dirs on OTHER co-located runners when host disk exceeds 85%.

Evidence — PR #1019 (4 consecutive failures)

  • Run 24803346062 job 72595673413 (initial): 3 concurrent runners pruned.
  • Run 24803346062 job 72595673413 (rerun 1): same.
  • Run 24803346062 job 72596255553 (rerun 2): 4 concurrent runners pruned.
  • Run 24805228768 job 72597635971 (fresh commit f60f013): disk-guard fired again.

Typical log signature:

[runner-disk-guard] pre-job: / at 90% ≥ 85% — pruning target/ dirs
[runner-disk-guard] aggressive prune: /home/noah/data/actions-runner-9/_work/aprender/aprender/target
[runner-disk-guard] aggressive prune: /home/noah/data/actions-runner-5/_work/aprender/aprender/target
[runner-disk-guard] aggressive prune: /home/noah/data/actions-runner-14/_work/aprender/aprender/target
[runner-disk-guard] freed approximately 1667976 KiB (mode=aggressive)
...
error: failed to build archive at `/__w/aprender/aprender/target/llvm-cov-target/debug/deps/libarrow_array-*.rlib`: failed to open object file: No such file or directory (os error 2)

Impact

Proposed fixes (pick one)

  1. Per-runner disk-guard scope — only prune actions-runner-N's own target dir, never a sibling's.
  2. Active-job awareness — skip prune if any sibling runner has an active job.
  3. Container isolation — if coverage runs in Docker, bind-mount a per-job target volume that's not on the shared host target path.
  4. Pre-job lock — serialize disk-guard across runners via a host-level mutex.

Workaround

Push trivial commit to retrigger — sometimes the disk-guard doesn't fire if disk is under 85% at that moment. Non-deterministic.

Related

  • Task chore(deps): Bump axum from 0.7.9 to 0.8.8 #134 (completed) added per-PR target dir + CARGO_INCREMENTAL=0 — fixed one class of cross-PR target collisions but not this cross-runner race.
  • Paiml shared-sovereign-ci workflow: paiml/.github/.github/workflows/sovereign-ci.yml

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions