chore(infra): self-hosted runner disk-guard automation by noahgift · Pull Request #1001 · paiml/aprender

noahgift · 2026-04-22T05:50:01Z

Summary

Prevents the / = 100% full failure class that took all 16 intel-clean-room-* runners offline on 2026-04-22 (diagnosed via gh api /orgs/paiml/actions/runners showing 16/16 offline, root cause: / at 3.5T/3.6T → runner _diag logs unwritable → GitHub marks runners offline).

Two-layer defence installed on intel as of this commit:

Pre-job hook (runner-pre-job.sh, wired via each runner's existing ACTIONS_RUNNER_HOOK_JOB_STARTED): when / ≥ 85%, aggressively prune _work/*/target/ before the job starts. Also retains the prior root-owned-file chown logic.
Nightly systemd timer (runner-disk-guard.timer → .service): at 04:00 local, prune any _work/*/target/ untouched for ≥ 7 days.

Emergency recovery (2026-04-22) freed 1.3 TB on intel by stopping all 16 runners, rm -rf _work, restarting. Without this automation the next fill-up would repeat that outage.

Test plan

16/16 runners online on GitHub after recovery + script install
systemctl list-timers shows runner-disk-guard.timer next fire at 04:15 UTC (04:00 + randomized delay)
Each runner .env already references /usr/local/bin/runner-pre-job.sh (verified on all 16)
Next job on any runner will exercise the pre-job hook (observable via logger -t runner-disk-guard in syslog)

🤖 Generated with Claude Code

Addresses 2026-04-22 outage where all 16 intel-clean-room runners went offline because / on intel hit 100% (3.5T/3.6T). Runner diag logs couldn't be written, so GitHub marked runners offline. Two layers of defence: - pre-job hook: aggressive target/ prune when disk >= 85% - nightly timer: prune target/ older than 7 days Scripts are runner-host-agnostic — install path and deployment recipe in scripts/runner-infra/README.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Previously `--pre-job` unconditionally rm -rf'd every target/ across the shared host whenever disk hit 85%, including target/ dirs on sibling runners that were mid-compile. This produced `rlib: No such file or directory` errors for hours whenever multiple runners ran simultaneously (observed across main CI and every cascade PR on 2026-04-22/23). Root cause: unconditional prune has no notion of "owned-by-active-job". Fix: before pruning each target/ dir, check for a Runner.Worker process whose cwd is inside that runner's tree. Skip if so. - `runner_has_active_worker` greps /proc/<pid>/cwd of Runner.Worker PIDs and matches them against the target/'s owning runner dir - `runner_dir_of` normalizes a target path to its /home/noah/data/actions-runner-N root via sed - On a shared host with ≥2 runners active, only idle runners' target/ dirs are pruned; actives keep their own target/ for the duration of their job Acceptable degenerate case: if ALL 16 runners happen to be active, prune frees 0 bytes. That's strictly better than clobbering in-progress builds. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The disk-guard added in #1001 walked only /home/noah/data/actions-runner*/_work/*/target/ — runner-workspace target dirs totalling ~75G across 8 runners. The actual runner-disk- fill source that took intel offline on 2026-04-23 was /mnt/nvme-raid0/targets/aprender-ci/*: per-PR bind-mount target dirs from ci.yml's task-#134 isolation, holding 1.9T including a 359G orphan `debug/` dir from pre-isolation era. Disk-guard never touched them. Adds new BIND_MOUNT_ROOTS (default `/mnt/nvme-raid0/targets/aprender-ci`) and a prune_bind_mount_target_roots() helper: - Always prunes `debug/` subdir (orphan, no current workflow bind-mounts it). - Prunes PR# subdirs stale past a minute threshold (nightly: STALE_DAYS×24×60 min; pre-job: 60-min floor so fresh in-flight dirs survive full-disk recovery). - Preserves `main` (push-to-main CI reuses it). Space-separated BIND_MOUNT_ROOTS env var lets the same script cover sibling fleets (sovereign-ci-paiml-mcp-agent-toolkit etc.) without code changes. Deployed to intel 2026-04-23T12:58Z; nightly dry-run confirmed no unexpected prune candidates under the new path. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 22, 2026 09:12

noahgift mentioned this pull request Apr 23, 2026

CI infra: disk-guard cross-runner race wipes target/ mid-build (ci/coverage persistent flake) #1020

Closed

noahgift force-pushed the chore/runner-disk-guard-automation branch 2 times, most recently from a5e9d35 to 9c3f6b3 Compare April 23, 2026 06:07

noahgift and others added 2 commits April 23, 2026 08:57

noahgift force-pushed the chore/runner-disk-guard-automation branch from 9c3f6b3 to bbfe1c5 Compare April 23, 2026 06:57

noahgift merged commit 51656d6 into main Apr 23, 2026
10 checks passed

noahgift deleted the chore/runner-disk-guard-automation branch April 23, 2026 07:16

This was referenced Apr 23, 2026

chore(ci): self-heal cargo registry cache against intel-runner race #1025

Closed

chore(infra): extend disk-guard to cover bind-mount target roots #1026

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(infra): self-hosted runner disk-guard automation#1001

chore(infra): self-hosted runner disk-guard automation#1001
noahgift merged 2 commits into
mainfrom
chore/runner-disk-guard-automation

noahgift commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 22, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant