Skip to content

chore(infra): self-hosted runner disk-guard automation#1001

Merged
noahgift merged 2 commits into
mainfrom
chore/runner-disk-guard-automation
Apr 23, 2026
Merged

chore(infra): self-hosted runner disk-guard automation#1001
noahgift merged 2 commits into
mainfrom
chore/runner-disk-guard-automation

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Prevents the / = 100% full failure class that took all 16 intel-clean-room-* runners offline on 2026-04-22 (diagnosed via gh api /orgs/paiml/actions/runners showing 16/16 offline, root cause: / at 3.5T/3.6T → runner _diag logs unwritable → GitHub marks runners offline).

Two-layer defence installed on intel as of this commit:

  1. Pre-job hook (runner-pre-job.sh, wired via each runner's existing ACTIONS_RUNNER_HOOK_JOB_STARTED): when / ≥ 85%, aggressively prune _work/*/target/ before the job starts. Also retains the prior root-owned-file chown logic.
  2. Nightly systemd timer (runner-disk-guard.timer.service): at 04:00 local, prune any _work/*/target/ untouched for ≥ 7 days.

Emergency recovery (2026-04-22) freed 1.3 TB on intel by stopping all 16 runners, rm -rf _work, restarting. Without this automation the next fill-up would repeat that outage.

Test plan

  • 16/16 runners online on GitHub after recovery + script install
  • systemctl list-timers shows runner-disk-guard.timer next fire at 04:15 UTC (04:00 + randomized delay)
  • Each runner .env already references /usr/local/bin/runner-pre-job.sh (verified on all 16)
  • Next job on any runner will exercise the pre-job hook (observable via logger -t runner-disk-guard in syslog)

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) April 22, 2026 09:12
@noahgift noahgift force-pushed the chore/runner-disk-guard-automation branch 2 times, most recently from a5e9d35 to 9c3f6b3 Compare April 23, 2026 06:07
noahgift and others added 2 commits April 23, 2026 08:57
Addresses 2026-04-22 outage where all 16 intel-clean-room runners went
offline because / on intel hit 100% (3.5T/3.6T). Runner diag logs
couldn't be written, so GitHub marked runners offline.

Two layers of defence:
- pre-job hook: aggressive target/ prune when disk >= 85%
- nightly timer: prune target/ older than 7 days

Scripts are runner-host-agnostic — install path and deployment recipe in
scripts/runner-infra/README.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously `--pre-job` unconditionally rm -rf'd every target/ across the shared
host whenever disk hit 85%, including target/ dirs on sibling runners that
were mid-compile. This produced `rlib: No such file or directory` errors for
hours whenever multiple runners ran simultaneously (observed across main CI
and every cascade PR on 2026-04-22/23).

Root cause: unconditional prune has no notion of "owned-by-active-job".
Fix: before pruning each target/ dir, check for a Runner.Worker process whose
cwd is inside that runner's tree. Skip if so.

- `runner_has_active_worker` greps /proc/<pid>/cwd of Runner.Worker PIDs and
  matches them against the target/'s owning runner dir
- `runner_dir_of` normalizes a target path to its /home/noah/data/actions-runner-N
  root via sed
- On a shared host with ≥2 runners active, only idle runners' target/ dirs are
  pruned; actives keep their own target/ for the duration of their job

Acceptable degenerate case: if ALL 16 runners happen to be active, prune frees
0 bytes. That's strictly better than clobbering in-progress builds.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the chore/runner-disk-guard-automation branch from 9c3f6b3 to bbfe1c5 Compare April 23, 2026 06:57
@noahgift noahgift merged commit 51656d6 into main Apr 23, 2026
10 checks passed
@noahgift noahgift deleted the chore/runner-disk-guard-automation branch April 23, 2026 07:16
noahgift added a commit that referenced this pull request May 14, 2026
The disk-guard added in #1001 walked only /home/noah/data/actions-runner*/_work/*/target/
— runner-workspace target dirs totalling ~75G across 8 runners. The actual runner-disk-
fill source that took intel offline on 2026-04-23 was /mnt/nvme-raid0/targets/aprender-ci/*:
per-PR bind-mount target dirs from ci.yml's task-#134 isolation, holding 1.9T including a
359G orphan `debug/` dir from pre-isolation era. Disk-guard never touched them.

Adds new BIND_MOUNT_ROOTS (default `/mnt/nvme-raid0/targets/aprender-ci`) and a
prune_bind_mount_target_roots() helper:

- Always prunes `debug/` subdir (orphan, no current workflow bind-mounts it).
- Prunes PR# subdirs stale past a minute threshold (nightly: STALE_DAYS×24×60 min;
  pre-job: 60-min floor so fresh in-flight dirs survive full-disk recovery).
- Preserves `main` (push-to-main CI reuses it).

Space-separated BIND_MOUNT_ROOTS env var lets the same script cover sibling fleets
(sovereign-ci-paiml-mcp-agent-toolkit etc.) without code changes.

Deployed to intel 2026-04-23T12:58Z; nightly dry-run confirmed no unexpected prune
candidates under the new path.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant